Getting a website with urllib results in HTTP 405 error - python

Im learning beautifulsoup and was trying to write a small script to find houses on a dutch real estate website. When I try to get the website's content, I'm immediately getting an HTTP405 error:
File "funda.py", line 2, in <module>
html = urlopen("http://www.funda.nl")
File "<folders>request.py", line 223, in urlopen
return opener.open(url, data, timeout)
File "<folders>request.py", line 532, in open
response = meth(req, response)
File "<folders>request.py", line 642, in http_response
'http', request, response, code, msg, hdrs)
File "<folders>request.py", line 570, in error
return self._call_chain(*args)
File "<folders>request.py", line 504, in _call_chain
result = func(*args)
File "<folders>request.py", line 650, in http_error_default
raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 405: Not Allowed
What im trying to execute:
from urllib.request import urlopen
html = urlopen("http://www.funda.nl")
Any idea why this is resulting in HTTP405? Im just doing a GET request, right?

Possible duplicate of HTTPError: HTTP Error 403: Forbidden. You need to fake that you are a regular visitor. This is generally (varies from site to site) done by using a common / regular User-Agent HTTP header.
>>> url = "http://www.funda.nl"
>>> import urllib.request
>>> req = urllib.request.Request(
... url,
... data=None,
... headers={
... 'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36'
... }
... )
>>> f = urllib.request.urlopen(req)
>>> f.status, f.msg
(200, 'OK')
Using the requests library -
>>> import requests
>>> response = requests.get(
... url,
... headers={
... 'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36'
... }
... )
>>> response.status_code
200

It works if you don't use Requests or urllib2:
import urllib
html = urllib.urlopen("http://www.funda.nl")
leovp's comment makes sense.

Related

403 Forbidden on site with urllib3

So I am working on a project crawling different sites. All sites work except for caesarscasino.com.
No matter what I try I get a 403 Forbidden Error. I have searched on here and others to no avail.
Here is my code:
import urllib3
import urllib.request, urllib.error
from urllib.request import Request
import ssl
try:
from urllib2 import urlopen
except ImportError:
from urllib.request import urlopen
ssl._create_default_https_context = ssl._create_unverified_context # overrides the default function for context creation with the function to create an unverified context.
urllib3.disable_warnings()
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
'Accept-Encoding': 'none',
'Accept-Language': 'en-US,en;q=0.8',
'Connection': 'keep-alive'}
url = 'https://www.caesarscasino.com/'
req = Request(url, headers=headers) #opens the URL
result = urllib.request.urlopen(req).read()
print(result)
With this error code:
Traceback (most recent call last):
File "C:\Users\sp\Desktop\untitled0.py", line 30, in <module>
result = urllib.request.urlopen(req).read()
File "C:\Users\sp\anaconda3\envs\spyder\lib\urllib\request.py", line 222, in urlopen
return opener.open(url, data, timeout)
File "C:\Users\sp\anaconda3\envs\spyder\lib\urllib\request.py", line 531, in open
response = meth(req, response)
File "C:\Users\sp\anaconda3\envs\spyder\lib\urllib\request.py", line 640, in http_response
response = self.parent.error(
File "C:\Users\sp\anaconda3\envs\spyder\lib\urllib\request.py", line 569, in error
return self._call_chain(*args)
File "C:\Users\sp\anaconda3\envs\spyder\lib\urllib\request.py", line 502, in _call_chain
result = func(*args)
File "C:\Users\sp\anaconda3\envs\spyder\lib\urllib\request.py", line 649, in http_error_default
raise HTTPError(req.full_url, code, msg, hdrs, fp)
HTTPError: Forbidden
The thing with scraping the web is, that not a lot of people like being scraped. Thus they do not allow a machine (which you scraper is) to access that page. This is the error you are getting. It basically means, do not access that site, when you are a programm. However, there are ways around that. Like spoofing the IP address and rotating headers, while your programm checks out this site. I already answered that question on how to do so here. Check it out and let me know in the comments whether or not that works for you.
I believe your issues are related to the fact that it's https. See here for info on how to fix that.

Python's requests library timing out but getting the response from the browser

I am trying to create a web scraper for NBA data. When I am running the below code:
import requests
response = requests.get('https://stats.nba.com/stats/leaguedashplayerstats?College=&Conference=&Country=&DateFrom=10%2F20%2F2017&DateTo=10%2F20%2F2017&Division=&DraftPick=&DraftYear=&GameScope=&GameSegment=&Height=&LastNGames=0&LeagueID=00&Location=&MeasureType=Base&Month=0&OpponentTeamID=0&Outcome=&PORound=0&PaceAdjust=N&PerMode=Totals&Period=0&PlayerExperience=&PlayerPosition=&PlusMinus=N&Rank=N&Season=2017-18&SeasonSegment=&SeasonType=Regular+Season&ShotClockRange=&StarterBench=&TeamID=0&VsConference=&VsDivision=&Weight=')
requests are timing out with the error:
File "C:\ProgramData\Anaconda3\lib\site-packages\requests\api.py",
line 70, in get
return request('get', url, params=params, **kwargs)
File "C:\ProgramData\Anaconda3\lib\site-packages\requests\api.py",
line 56, in request
return session.request(method=method, url=url, **kwargs)
File
"C:\ProgramData\Anaconda3\lib\site-packages\requests\sessions.py",
line 488, in request
resp = self.send(prep, **send_kwargs)
File
"C:\ProgramData\Anaconda3\lib\site-packages\requests\sessions.py",
line 609, in send
r = adapter.send(request, **kwargs)
File
"C:\ProgramData\Anaconda3\lib\site-packages\requests\adapters.py",
line 473, in send
raise ConnectionError(err, request=request)
ConnectionError: ('Connection aborted.', OSError("(10060,
'WSAETIMEDOUT')",))
However, when I hit the same URL in the browser, I am getting a response.
Looks like the website you mentioned is checking for "User-Agent" in the request's header. You can fake the "User-Agent" in your request to make it look like it is coming from the actual browser and you'll receive the response.
For example:
import requests
url = "https://stats.nba.com/stats/leaguedashplayerstats?College=&Conference=&Country=&DateFrom=10%2F20%2F2017&DateTo=10%2F20%2F2017&Division=&DraftPick=&DraftYear=&GameScope=&GameSegment=&Height=&LastNGames=0&LeagueID=00&Location=&MeasureType=Base&Month=0&OpponentTeamID=0&Outcome=&PORound=0&PaceAdjust=N&PerMode=Totals&Period=0&PlayerExperience=&PlayerPosition=&PlusMinus=N&Rank=N&Season=2017-18&SeasonSegment=&SeasonType=Regular+Season&ShotClockRange=&StarterBench=&TeamID=0&VsConference=&VsDivision=&Weight="
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36'}
# it's the user-agent of my browser ^
response = requests.get(url, headers=headers)
response.status_code # will return: 200
response.text # will return the website content
You can find the user-agent of your browser from here.
if it's still not working, use this header:
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36','Accept-Encoding': 'gzip, deflate, br','Accept-Language': 'en-US,en;q=0.9,hi;q=0.8'}
If other headers does not work, try this HEADER , it worked pretty well for me.
headers = {"User-Agent":"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_5) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/12.1.1 Safari/605.1.15","Accept-Language": "en-gb","Accept-Encoding":"br, gzip, deflate","Accept":"test/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8","Referer":"http://www.google.com/"}
collected these headers from this link

NOAA CO-OPS API request forbidden [duplicate]

This code works well on my local machine, but when I upload and run it on pythonanywhere.com it gives me this error.
My Code:
url = "http://www.codeforces.com/api/contest.list?gym=false"
hdr = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
'Accept-Encoding': 'none',
'Accept-Language': 'en-US,en;q=0.8',
'Connection': 'keep-alive'}
req = urllib2.Request(url, headers=hdr)
opener = urllib2.build_opener()
openedReq = opener.open(req, timeout=300)
The error:
Traceback (most recent call last):
File "/home/GehadAbdallah/main.py", line 135, in openApi
openedReq = opener.open(req, timeout=300)
File "/usr/lib/python2.7/urllib2.py", line 410, in open
response = meth(req, response)
File "/usr/lib/python2.7/urllib2.py", line 523, in http_response
'http', request, response, code, msg, hdrs)
File "/usr/lib/python2.7/urllib2.py", line 448, in error
return self._call_chain(*args)
File "/usr/lib/python2.7/urllib2.py", line 382, in _call_chain
result = func(*args)
File "/usr/lib/python2.7/urllib2.py", line 531, in http_error_default
raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
urllib2.HTTPError: HTTP Error 403: Forbidden
P.S. i'm working on python 2.7
Free accounts on PythonAnywhere are restricted to a whitelist of sites, http/https only, and access goes via a proxy. There's more info here:
PythonAnywhere wiki: "why do I get a 403 forbidden error when opening a url?"
I recently used urllib2 with a flask project on pythonanywhere using their free account to access an api at donorschoose.org
This might be helpful,
#app.route('/funding')
def fundingByState():
urllib2.install_opener(urllib2.build_opener(urllib2.ProxyHandler({'http': 'proxy.server:3128'})))
donors_choose_url = "http://api.donorschoose.org/common/json_feed.html?historical=true&APIKey=DONORSCHOOSE"
response = urllib2.urlopen(donors_choose_url)
json_response = json.load(response)
return json.dumps(json_response)
This does work.
If you're using paid account but still get this error message
try this pythonanywhere_forums
To me, I have to delete the console then restart a new one.

open url from pythonanywhere

This code works well on my local machine, but when I upload and run it on pythonanywhere.com it gives me this error.
My Code:
url = "http://www.codeforces.com/api/contest.list?gym=false"
hdr = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
'Accept-Encoding': 'none',
'Accept-Language': 'en-US,en;q=0.8',
'Connection': 'keep-alive'}
req = urllib2.Request(url, headers=hdr)
opener = urllib2.build_opener()
openedReq = opener.open(req, timeout=300)
The error:
Traceback (most recent call last):
File "/home/GehadAbdallah/main.py", line 135, in openApi
openedReq = opener.open(req, timeout=300)
File "/usr/lib/python2.7/urllib2.py", line 410, in open
response = meth(req, response)
File "/usr/lib/python2.7/urllib2.py", line 523, in http_response
'http', request, response, code, msg, hdrs)
File "/usr/lib/python2.7/urllib2.py", line 448, in error
return self._call_chain(*args)
File "/usr/lib/python2.7/urllib2.py", line 382, in _call_chain
result = func(*args)
File "/usr/lib/python2.7/urllib2.py", line 531, in http_error_default
raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
urllib2.HTTPError: HTTP Error 403: Forbidden
P.S. i'm working on python 2.7
Free accounts on PythonAnywhere are restricted to a whitelist of sites, http/https only, and access goes via a proxy. There's more info here:
PythonAnywhere wiki: "why do I get a 403 forbidden error when opening a url?"
I recently used urllib2 with a flask project on pythonanywhere using their free account to access an api at donorschoose.org
This might be helpful,
#app.route('/funding')
def fundingByState():
urllib2.install_opener(urllib2.build_opener(urllib2.ProxyHandler({'http': 'proxy.server:3128'})))
donors_choose_url = "http://api.donorschoose.org/common/json_feed.html?historical=true&APIKey=DONORSCHOOSE"
response = urllib2.urlopen(donors_choose_url)
json_response = json.load(response)
return json.dumps(json_response)
This does work.
If you're using paid account but still get this error message
try this pythonanywhere_forums
To me, I have to delete the console then restart a new one.

Python login to my cpanel with python

I want to create a script with python that test if my combination (username and password) is correct, but I always get a 401 HTTP response so i think the script can't submit the login data. (the cpanel login isn't a traditional login panel so i will use the demo login panel as our example-site.com) :
import urllib, urllib2, os, sys, re
site = 'http://cpanel.demo.cpanel.net/'
username = 'demo'
password = 'demo'
headers = {
"User-Agent" : "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:28.0) Gecko/20100101 Firefox/28.0",
"Accept" : "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"Accept-Encoding" : "gzip, deflate",
"Accept-Charset" : "ISO-8859-1,utf-8;q=0.7,*;q=0.7"}
data = [
("user",username),
("pass",password),
("testcookie",1),
("submit","Log In"),
("redirect_to",'http://cpanel.demo.cpanel.net/'),
("rememberme","forever")]
req = urllib2.Request(site, urllib.urlencode(dict(data)), dict(headers))
response = urllib2.urlopen(req)
if any('index.html' in v for v in response.headers.values()) == True :
print "Correct login"
else :
print "incorrect"
I get this error :
Traceback (most recent call last):
File "C:\Python27\cp\cp4.py", line 19, in <module>
response = urllib2.urlopen(req)
File "C:\Python27\lib\urllib2.py", line 126, in urlopen
return _opener.open(url, data, timeout)
File "C:\Python27\lib\urllib2.py", line 400, in open
response = meth(req, response)
File "C:\Python27\lib\urllib2.py", line 513, in http_response
'http', request, response, code, msg, hdrs)
File "C:\Python27\lib\urllib2.py", line 438, in error
return self._call_chain(*args)
File "C:\Python27\lib\urllib2.py", line 372, in _call_chain
result = func(*args)
File "C:\Python27\lib\urllib2.py", line 521, in http_error_default
raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
urllib2.HTTPError: HTTP Error 401: Access Denied
Any ideas to solve the problem and test the login details?
You need to POST the login info to http://cpanel.demo.cpanel.net/login/
Example using requests (much easier!):
import requests
logininfo = {'user':'demo', 'pass':'demo'}
r = requests.post("http://cpanel.demo.cpanel.net/login/", data=logininfo)
if (r.status_code==200):
print "Correct login"
Consider using Requests, a much more user-friendly HTTP client library for Python.
import requests
url = 'http://cpanel.demo.cpanel.net/login/'
username = 'demo'
password = 'demo'
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:31.0) Gecko/20100101 Firefox/31.0',
}
data = {
'user': username,
'pass': password,
}
response = requests.post(url, headers=headers, data=data)
if response.status_code == 200:
print "Successfully logged in as {username}".format(username=username)
else:
print "Login unsuccessful: HTTP/{status_code}".format(status_code=response.status_code)
Edited to check for HTTP/200, as CPanel does throw an HTTP/401 if the login is incorrect.

Categories