403 Forbidden on site with urllib3 - python

So I am working on a project crawling different sites. All sites work except for caesarscasino.com.
No matter what I try I get a 403 Forbidden Error. I have searched on here and others to no avail.
Here is my code:
import urllib3
import urllib.request, urllib.error
from urllib.request import Request
import ssl
try:
from urllib2 import urlopen
except ImportError:
from urllib.request import urlopen
ssl._create_default_https_context = ssl._create_unverified_context # overrides the default function for context creation with the function to create an unverified context.
urllib3.disable_warnings()
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
'Accept-Encoding': 'none',
'Accept-Language': 'en-US,en;q=0.8',
'Connection': 'keep-alive'}
url = 'https://www.caesarscasino.com/'
req = Request(url, headers=headers) #opens the URL
result = urllib.request.urlopen(req).read()
print(result)
With this error code:
Traceback (most recent call last):
File "C:\Users\sp\Desktop\untitled0.py", line 30, in <module>
result = urllib.request.urlopen(req).read()
File "C:\Users\sp\anaconda3\envs\spyder\lib\urllib\request.py", line 222, in urlopen
return opener.open(url, data, timeout)
File "C:\Users\sp\anaconda3\envs\spyder\lib\urllib\request.py", line 531, in open
response = meth(req, response)
File "C:\Users\sp\anaconda3\envs\spyder\lib\urllib\request.py", line 640, in http_response
response = self.parent.error(
File "C:\Users\sp\anaconda3\envs\spyder\lib\urllib\request.py", line 569, in error
return self._call_chain(*args)
File "C:\Users\sp\anaconda3\envs\spyder\lib\urllib\request.py", line 502, in _call_chain
result = func(*args)
File "C:\Users\sp\anaconda3\envs\spyder\lib\urllib\request.py", line 649, in http_error_default
raise HTTPError(req.full_url, code, msg, hdrs, fp)
HTTPError: Forbidden

The thing with scraping the web is, that not a lot of people like being scraped. Thus they do not allow a machine (which you scraper is) to access that page. This is the error you are getting. It basically means, do not access that site, when you are a programm. However, there are ways around that. Like spoofing the IP address and rotating headers, while your programm checks out this site. I already answered that question on how to do so here. Check it out and let me know in the comments whether or not that works for you.

I believe your issues are related to the fact that it's https. See here for info on how to fix that.

Related

Getting a website with urllib results in HTTP 405 error

Im learning beautifulsoup and was trying to write a small script to find houses on a dutch real estate website. When I try to get the website's content, I'm immediately getting an HTTP405 error:
File "funda.py", line 2, in <module>
html = urlopen("http://www.funda.nl")
File "<folders>request.py", line 223, in urlopen
return opener.open(url, data, timeout)
File "<folders>request.py", line 532, in open
response = meth(req, response)
File "<folders>request.py", line 642, in http_response
'http', request, response, code, msg, hdrs)
File "<folders>request.py", line 570, in error
return self._call_chain(*args)
File "<folders>request.py", line 504, in _call_chain
result = func(*args)
File "<folders>request.py", line 650, in http_error_default
raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 405: Not Allowed
What im trying to execute:
from urllib.request import urlopen
html = urlopen("http://www.funda.nl")
Any idea why this is resulting in HTTP405? Im just doing a GET request, right?
Possible duplicate of HTTPError: HTTP Error 403: Forbidden. You need to fake that you are a regular visitor. This is generally (varies from site to site) done by using a common / regular User-Agent HTTP header.
>>> url = "http://www.funda.nl"
>>> import urllib.request
>>> req = urllib.request.Request(
... url,
... data=None,
... headers={
... 'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36'
... }
... )
>>> f = urllib.request.urlopen(req)
>>> f.status, f.msg
(200, 'OK')
Using the requests library -
>>> import requests
>>> response = requests.get(
... url,
... headers={
... 'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36'
... }
... )
>>> response.status_code
200
It works if you don't use Requests or urllib2:
import urllib
html = urllib.urlopen("http://www.funda.nl")
leovp's comment makes sense.

NOAA CO-OPS API request forbidden [duplicate]

This code works well on my local machine, but when I upload and run it on pythonanywhere.com it gives me this error.
My Code:
url = "http://www.codeforces.com/api/contest.list?gym=false"
hdr = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
'Accept-Encoding': 'none',
'Accept-Language': 'en-US,en;q=0.8',
'Connection': 'keep-alive'}
req = urllib2.Request(url, headers=hdr)
opener = urllib2.build_opener()
openedReq = opener.open(req, timeout=300)
The error:
Traceback (most recent call last):
File "/home/GehadAbdallah/main.py", line 135, in openApi
openedReq = opener.open(req, timeout=300)
File "/usr/lib/python2.7/urllib2.py", line 410, in open
response = meth(req, response)
File "/usr/lib/python2.7/urllib2.py", line 523, in http_response
'http', request, response, code, msg, hdrs)
File "/usr/lib/python2.7/urllib2.py", line 448, in error
return self._call_chain(*args)
File "/usr/lib/python2.7/urllib2.py", line 382, in _call_chain
result = func(*args)
File "/usr/lib/python2.7/urllib2.py", line 531, in http_error_default
raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
urllib2.HTTPError: HTTP Error 403: Forbidden
P.S. i'm working on python 2.7
Free accounts on PythonAnywhere are restricted to a whitelist of sites, http/https only, and access goes via a proxy. There's more info here:
PythonAnywhere wiki: "why do I get a 403 forbidden error when opening a url?"
I recently used urllib2 with a flask project on pythonanywhere using their free account to access an api at donorschoose.org
This might be helpful,
#app.route('/funding')
def fundingByState():
urllib2.install_opener(urllib2.build_opener(urllib2.ProxyHandler({'http': 'proxy.server:3128'})))
donors_choose_url = "http://api.donorschoose.org/common/json_feed.html?historical=true&APIKey=DONORSCHOOSE"
response = urllib2.urlopen(donors_choose_url)
json_response = json.load(response)
return json.dumps(json_response)
This does work.
If you're using paid account but still get this error message
try this pythonanywhere_forums
To me, I have to delete the console then restart a new one.

open url from pythonanywhere

This code works well on my local machine, but when I upload and run it on pythonanywhere.com it gives me this error.
My Code:
url = "http://www.codeforces.com/api/contest.list?gym=false"
hdr = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
'Accept-Encoding': 'none',
'Accept-Language': 'en-US,en;q=0.8',
'Connection': 'keep-alive'}
req = urllib2.Request(url, headers=hdr)
opener = urllib2.build_opener()
openedReq = opener.open(req, timeout=300)
The error:
Traceback (most recent call last):
File "/home/GehadAbdallah/main.py", line 135, in openApi
openedReq = opener.open(req, timeout=300)
File "/usr/lib/python2.7/urllib2.py", line 410, in open
response = meth(req, response)
File "/usr/lib/python2.7/urllib2.py", line 523, in http_response
'http', request, response, code, msg, hdrs)
File "/usr/lib/python2.7/urllib2.py", line 448, in error
return self._call_chain(*args)
File "/usr/lib/python2.7/urllib2.py", line 382, in _call_chain
result = func(*args)
File "/usr/lib/python2.7/urllib2.py", line 531, in http_error_default
raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
urllib2.HTTPError: HTTP Error 403: Forbidden
P.S. i'm working on python 2.7
Free accounts on PythonAnywhere are restricted to a whitelist of sites, http/https only, and access goes via a proxy. There's more info here:
PythonAnywhere wiki: "why do I get a 403 forbidden error when opening a url?"
I recently used urllib2 with a flask project on pythonanywhere using their free account to access an api at donorschoose.org
This might be helpful,
#app.route('/funding')
def fundingByState():
urllib2.install_opener(urllib2.build_opener(urllib2.ProxyHandler({'http': 'proxy.server:3128'})))
donors_choose_url = "http://api.donorschoose.org/common/json_feed.html?historical=true&APIKey=DONORSCHOOSE"
response = urllib2.urlopen(donors_choose_url)
json_response = json.load(response)
return json.dumps(json_response)
This does work.
If you're using paid account but still get this error message
try this pythonanywhere_forums
To me, I have to delete the console then restart a new one.

Python login to my cpanel with python

I want to create a script with python that test if my combination (username and password) is correct, but I always get a 401 HTTP response so i think the script can't submit the login data. (the cpanel login isn't a traditional login panel so i will use the demo login panel as our example-site.com) :
import urllib, urllib2, os, sys, re
site = 'http://cpanel.demo.cpanel.net/'
username = 'demo'
password = 'demo'
headers = {
"User-Agent" : "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:28.0) Gecko/20100101 Firefox/28.0",
"Accept" : "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"Accept-Encoding" : "gzip, deflate",
"Accept-Charset" : "ISO-8859-1,utf-8;q=0.7,*;q=0.7"}
data = [
("user",username),
("pass",password),
("testcookie",1),
("submit","Log In"),
("redirect_to",'http://cpanel.demo.cpanel.net/'),
("rememberme","forever")]
req = urllib2.Request(site, urllib.urlencode(dict(data)), dict(headers))
response = urllib2.urlopen(req)
if any('index.html' in v for v in response.headers.values()) == True :
print "Correct login"
else :
print "incorrect"
I get this error :
Traceback (most recent call last):
File "C:\Python27\cp\cp4.py", line 19, in <module>
response = urllib2.urlopen(req)
File "C:\Python27\lib\urllib2.py", line 126, in urlopen
return _opener.open(url, data, timeout)
File "C:\Python27\lib\urllib2.py", line 400, in open
response = meth(req, response)
File "C:\Python27\lib\urllib2.py", line 513, in http_response
'http', request, response, code, msg, hdrs)
File "C:\Python27\lib\urllib2.py", line 438, in error
return self._call_chain(*args)
File "C:\Python27\lib\urllib2.py", line 372, in _call_chain
result = func(*args)
File "C:\Python27\lib\urllib2.py", line 521, in http_error_default
raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
urllib2.HTTPError: HTTP Error 401: Access Denied
Any ideas to solve the problem and test the login details?
You need to POST the login info to http://cpanel.demo.cpanel.net/login/
Example using requests (much easier!):
import requests
logininfo = {'user':'demo', 'pass':'demo'}
r = requests.post("http://cpanel.demo.cpanel.net/login/", data=logininfo)
if (r.status_code==200):
print "Correct login"
Consider using Requests, a much more user-friendly HTTP client library for Python.
import requests
url = 'http://cpanel.demo.cpanel.net/login/'
username = 'demo'
password = 'demo'
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:31.0) Gecko/20100101 Firefox/31.0',
}
data = {
'user': username,
'pass': password,
}
response = requests.post(url, headers=headers, data=data)
if response.status_code == 200:
print "Successfully logged in as {username}".format(username=username)
else:
print "Login unsuccessful: HTTP/{status_code}".format(status_code=response.status_code)
Edited to check for HTTP/200, as CPanel does throw an HTTP/401 if the login is incorrect.

python timer + urllib2 code errors

I am trying to pull information from a site ever 5 seconds but it doesn't seem to be working and I get errors every time I run it.
Code below:
import urllib2, threading
def readpage():
data = urllib2.urlopen('http://forums.zybez.net/runescape-2007-prices').read()
for line in data:
if 'forums.zybez.net/runescape-2007-prices/player/' in line:
a = line.split('/runescape-2007-prices/player/'[1])
print(a.split('">')[0])
t = threading.Timer(5.0, readpage)
t.start()
I get these errors:
Exception in thread Thread-1:
Traceback (most recent call last):
File "C:\Python27\lib\threading.py", line 808, in __bootstrap_inner
self.run()
File "C:\Python27\lib\threading.py", line 1080, in run
self.function(*self.args, **self.kwargs)
File "C:\Users\Jordan\Desktop\username.py", line 3, in readpage
data = urllib2.urlopen('http://forums.zybez.net/runescape-2007-prices').rea
()
File "C:\Python27\lib\urllib2.py", line 127, in urlopen
return _opener.open(url, data, timeout)
File "C:\Python27\lib\urllib2.py", line 410, in open
response = meth(req, response)
File "C:\Python27\lib\urllib2.py", line 523, in http_response
'http', request, response, code, msg, hdrs)
File "C:\Python27\lib\urllib2.py", line 448, in error
return self._call_chain(*args)
File "C:\Python27\lib\urllib2.py", line 382, in _call_chain
result = func(*args)
File "C:\Python27\lib\urllib2.py", line 531, in http_error_default
raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
HTTPError: HTTP Error 403: Forbidden
Help would be appreciated, thanks!
The site is rejecting the default User-Agent reported by urllib2. You can change it for all requests in the script using install_opener.
opener = urllib2.build_opener()
opener.addheaders = [('User-agent', 'Mozilla/5.0 (X11; Linux i686; rv:5.0) Gecko/20100101 Firefox/5.0')]
urllib2.install_opener(opener)
You'll also need to split the data from by the site to read it line by line
urllib2.urlopen('http://forums.zybez.net/runescape-2007-prices').read().splitlines()
and change
line.split('/runescape-2007-prices/player/'[1])
to
line.split('/runescape-2007-prices/player/')[1]
Working:
import urllib2, threading
opener = urllib2.build_opener()
opener.addheaders = [('User-agent', 'Mozilla/5.0 (X11; Linux i686; rv:5.0) Gecko/20100101 Firefox/5.0')]
urllib2.install_opener(opener)
def readpage():
data = urllib2.urlopen('http://forums.zybez.net/runescape-2007-prices').read().splitlines()
for line in data:
if 'forums.zybez.net/runescape-2007-prices/player/' in line:
a = line.split('/runescape-2007-prices/player/')[1]
print(a.split('">')[0])
t = threading.Timer(5.0, readpage)
t.start()
Did you try opening that url without the thread? The error code says 403: Forbidden, maybe you need authentication for that web page.
This has nothing to do with Python -- the server is denying your requests to that URL.
I suspect that either the URL is incorrect or you've hit some kind of rate limiting and are being blocked.
EDIT: how to make it work
The site is blocking Python's User-Agent. Try this:
import urllib2, threading
def readpage():
headers = { 'User-Agent' : 'Mozilla/5.0' }
req = urllib2.Request('http://forums.zybez.net/runescape-2007-prices', None, headers)
data = urllib2.urlopen(req).read()
for line in data:
if 'forums.zybez.net/runescape-2007-prices/player/' in line:
a = line.split('/runescape-2007-prices/player/'[1])
print(a.split('">')[0])

Categories