Thanks for reading. For a small reserach project, I'm trying to gather some data from KBB (www.kbb.com). However, I'm always getting a "urllib.error.HTTPError: HTTP Error 400: Bad Request" Error. I think I can access different websites with this simple piece of code. I'm not sure if this is an issue with the code or the specific website itself?
Maybe someone can point me in the right direction.
from urllib import request as urlrequest
proxy_host = '23.107.176.36:32180'
url = "https://www.kbb.com/gmc/canyon-extended-cab/2018/"
req = urlrequest.Request(url)
req.set_proxy(proxy_host, 'https')
page = urlrequest.urlopen(req)
print(page)
There are 2 issue but one solution as I found below
Is the proxy server which is refused.
You need authentication for the server in every case it responds with a 403 forbidden
Using urlib
from urllib import request as urlrequest
proxy_host = '23.107.176.36:32180'
url = "https://www.kbb.com/gmc/canyon-extended-cab/2018/"
req = urlrequest.Request(url)
# req.set_proxy(proxy_host, 'https')
page = urlrequest.urlopen(req)
print(req)
> urllib.error.HTTPError: HTTP Error 403: Forbidden
Using Requests
import requests
url = "https://www.kbb.com/gmc/canyon-extended-cab/2018/"
res = requests.get(url)
print(res)
# >>> <Response [403]>
Using PostMan
edit Solution
Setting a timeout litter longer it works. however I had to retry several times, because the proxy sometimes just dont' reponds
import urllib.request
proxy_host = '23.107.176.36:32180'
url = "https://www.kbb.com/gmc/canyon-extended-cab/2018/"
proxy_support = urllib.request.ProxyHandler({'https' : proxy_host})
opener = urllib.request.build_opener(proxy_support)
urllib.request.install_opener(opener)
res = urllib.request.urlopen(url, timeout=1000) # Set
print(res.read())
Result
b'<!doctype html><html lang="en"><head><meta http-equiv="X-UA-Compatible" content="IE=edge"><meta charset="utf-8"><meta name="viewport" content="width=device-width,initial-scale=1,maximum-scale=5,minimum-scale=1"><meta http-equiv="x-dns-prefetch-control" content="on"><link rel="dns-prefetch preconnect" href="//securepubads.g.doubleclick.net" crossorigin><link rel="dns-prefetch preconnect" href="//c.amazon-adsystem.com" crossorigin><link .........
Using Requests
import requests
proxy_host = '23.107.176.36:32180'
url = "https://www.kbb.com/gmc/canyon-extended-cab/2018/"
# NOTE: we need a loger timeout for the proxy t response and set verify sale for an ssl error
r = requests.get(url, proxies={"https": proxy_host}, timeout=90000, verify=False) # Timeout are in milliseconds
print(r.text)
Your code appears to work fine without the set_proxy statement, I think it is most likely that your proxy server is rejecting the request rather than KBB.
Related
I've been looking through the Python Requests documentation but I cannot see any functionality for what I am trying to achieve.
In my script I am setting allow_redirects=True.
I would like to know if the page has been redirected to something else, what is the new URL.
For example, if the start URL was: www.google.com/redirect
And the final URL is www.google.co.uk/redirected
How do I get that URL?
You are looking for the request history.
The response.history attribute is a list of responses that led to the final URL, which can be found in response.url.
response = requests.get(someurl)
if response.history:
print("Request was redirected")
for resp in response.history:
print(resp.status_code, resp.url)
print("Final destination:")
print(response.status_code, response.url)
else:
print("Request was not redirected")
Demo:
>>> import requests
>>> response = requests.get('http://httpbin.org/redirect/3')
>>> response.history
(<Response [302]>, <Response [302]>, <Response [302]>)
>>> for resp in response.history:
... print(resp.status_code, resp.url)
...
302 http://httpbin.org/redirect/3
302 http://httpbin.org/redirect/2
302 http://httpbin.org/redirect/1
>>> print(response.status_code, response.url)
200 http://httpbin.org/get
This is answering a slightly different question, but since I got stuck on this myself, I hope it might be useful for someone else.
If you want to use allow_redirects=False and get directly to the first redirect object, rather than following a chain of them, and you just want to get the redirect location directly out of the 302 response object, then r.url won't work. Instead, it's the "Location" header:
r = requests.get('http://github.com/', allow_redirects=False)
r.status_code # 302
r.url # http://github.com, not https.
r.headers['Location'] # https://github.com/ -- the redirect destination
I think requests.head instead of requests.get will be more safe to call when handling url redirect. Check a GitHub issue here:
r = requests.head(url, allow_redirects=True)
print(r.url)
the documentation has this blurb https://requests.readthedocs.io/en/master/user/quickstart/#redirection-and-history
import requests
r = requests.get('http://www.github.com')
r.url
#returns https://www.github.com instead of the http page you asked for
For python3.5, you can use the following code:
import urllib.request
res = urllib.request.urlopen(starturl)
finalurl = res.geturl()
print(finalurl)
I wrote the following function to get the full URL from a short URL (bit.ly, t.co, ...)
import requests
def expand_short_url(url):
r = requests.head(url, allow_redirects=False)
r.raise_for_status()
if 300 < r.status_code < 400:
url = r.headers.get('Location', url)
return url
Usage (short URL is this question's url):
short_url = 'https://tinyurl.com/' + '4d4ytpbx'
full_url = expand_short_url(short_url)
print(full_url)
Output:
https://stackoverflow.com/questions/20475552/python-requests-library-redirect-new-url
I wasn't able to use requests library and had to go different way. Here is the code that I post as solution to this post. (To get redirected URL with requests)
This way you actually open the browser, wait for your browser to log the url in the history log and then read last url in your history. I wrote this code for google chrom, but you should be able to follow along if you are using different browser.
import webbrowser
import sqlite3
import pandas as pd
import shutil
webbrowser.open("https://twitter.com/i/user/2274951674")
#source file is where the history of your webbroser is saved, I was using chrome, but it should be the same process if you are using different browser
source_file = 'C:\\Users\\{your_user_id}\\AppData\\Local\\Google\\Chrome\\User Data\\Default\\History'
# could not directly connect to history file as it was locked and had to make a copy of it in different location
destination_file = 'C:\\Users\\{user}\\Downloads\\History'
time.sleep(30) # there is some delay to update the history file, so 30 sec wait give it enough time to make sure your last url get logged
shutil.copy(source_file,destination_file) # copying the file.
con = sqlite3.connect('C:\\Users\\{user}\\Downloads\\History')#connecting to browser history
cursor = con.execute("SELECT * FROM urls")
names = [description[0] for description in cursor.description]
urls = cursor.fetchall()
con.close()
df_history = pd.DataFrame(urls,columns=names)
last_url = df_history.loc[len(df_history)-1,'url']
print(last_url)
>>https://twitter.com/ozanbayram01
All the answers are applicable where the final url exists/working fine.
In case, final URL doesn't seems to work then below is way to capture all redirects.
There was scenario where final URL isn't working anymore and other ways like url history give error.
Code Snippet
long_url = ''
url = 'http://example.com/bla-bla'
try:
while True:
long_url = requests.head(url).headers['location']
print(long_url)
url = long_url
except:
print(long_url)
I am new at this, but trying to scrape data from the website that requires a log in. Getting an error trying to open it. It appear that the problems is in cookies, that they are not being properly stored?
import requests
from bs4 import BeautifulSoup
from urllib.request import urlopen
from http.cookiejar import CookieJar
import urllib
username = 'xxx'
password = 'xxx'
values = {'email': username, 'password': password}
session = requests.session()
login_url = 'https://login.aripaev.ee/Account/Login?ReturnUrl=%2fOAuth%2fAuthorize%3fclient_id%3dinfopank%26redirect_uri%3dhttps%253A%252F%252Finfopank.ee%252FAccount%252FLogin%253FreturnUrl%253D%25252F%2526returnAsRedirect%253DFalse%26state%3dLjNuwARtELJnVPcF8ka2Jg%26scope%3d%252FUserDataService%252Fjson%252FProfile%2520%252FUserDataService%252Fjson%252FPermissions%2520%252FUserDataService%252Fjson%252FOrders%2520%252FUserDataService%252Fv2%252Fjson%252FProfile%2520%252FUserDataService%252Fv2%252Fjson%252FPermissions%2520%252FUserDataService%252Fv2%252Fjson%252FOrders%26response_type%3dcode&client_id=infopank&redirect_uri=https%3A%2F%2Finfopank.ee%2FAccount%2FLogin%3FreturnUrl%3D%252F%26returnAsRedirect%3DFalse&state=LjNuwARtELJnVPcF8ka2Jg&scope=%2FUserDataService%2Fjson%2FProfile%20%2FUserDataService%2Fjson%2FPermissions%20%2FUserDataService%2Fjson%2FOrders%20%2FUserDataService%2Fv2%2Fjson%2FProfile%20%2FUserDataService%2Fv2%2Fjson%2FPermissions%20%2FUserDataService%2Fv2%2Fjson%2FOrders&response_type=code'
url = 'https://infopank.ee/ettevote/1/'
result = session.get(login_url)
result = session.post(login_url, data = values, headers = dict(referer=login_url))
cookieProcessor = urllib.request.HTTPCookieProcessor()
opener = urllib.request.build_opener(cookieProcessor)
page = urlopen(url)
Error message:
HTTPError: HTTP Error 302: The HTTP server returned a redirect error that would lead to an infinite loop.
The last 30x error message was:
Found
Any suggestions are welcome - thanks!
Don't mix urllib.request with requests. If you are going to use requests, it will just work fine.
Remove these lines from your program:
from urllib.request import urlopen
from http.cookiejar import CookieJar
import urllib
cookieProcessor = urllib.request.HTTPCookieProcessor()
opener = urllib.request.build_opener(cookieProcessor)
page = urlopen(url)
This code has the issue that it doesn't have the cookies that were in the requests.session and also that the call to urlopen uses the default opener which has no cookie support at all. Rather opener.open should have been used.
Replace this with:
page = session.get(url)
Then the requests.session keeps track of the cookies for you.
I've been looking through the Python Requests documentation but I cannot see any functionality for what I am trying to achieve.
In my script I am setting allow_redirects=True.
I would like to know if the page has been redirected to something else, what is the new URL.
For example, if the start URL was: www.google.com/redirect
And the final URL is www.google.co.uk/redirected
How do I get that URL?
You are looking for the request history.
The response.history attribute is a list of responses that led to the final URL, which can be found in response.url.
response = requests.get(someurl)
if response.history:
print("Request was redirected")
for resp in response.history:
print(resp.status_code, resp.url)
print("Final destination:")
print(response.status_code, response.url)
else:
print("Request was not redirected")
Demo:
>>> import requests
>>> response = requests.get('http://httpbin.org/redirect/3')
>>> response.history
(<Response [302]>, <Response [302]>, <Response [302]>)
>>> for resp in response.history:
... print(resp.status_code, resp.url)
...
302 http://httpbin.org/redirect/3
302 http://httpbin.org/redirect/2
302 http://httpbin.org/redirect/1
>>> print(response.status_code, response.url)
200 http://httpbin.org/get
This is answering a slightly different question, but since I got stuck on this myself, I hope it might be useful for someone else.
If you want to use allow_redirects=False and get directly to the first redirect object, rather than following a chain of them, and you just want to get the redirect location directly out of the 302 response object, then r.url won't work. Instead, it's the "Location" header:
r = requests.get('http://github.com/', allow_redirects=False)
r.status_code # 302
r.url # http://github.com, not https.
r.headers['Location'] # https://github.com/ -- the redirect destination
I think requests.head instead of requests.get will be more safe to call when handling url redirect. Check a GitHub issue here:
r = requests.head(url, allow_redirects=True)
print(r.url)
the documentation has this blurb https://requests.readthedocs.io/en/master/user/quickstart/#redirection-and-history
import requests
r = requests.get('http://www.github.com')
r.url
#returns https://www.github.com instead of the http page you asked for
For python3.5, you can use the following code:
import urllib.request
res = urllib.request.urlopen(starturl)
finalurl = res.geturl()
print(finalurl)
I wrote the following function to get the full URL from a short URL (bit.ly, t.co, ...)
import requests
def expand_short_url(url):
r = requests.head(url, allow_redirects=False)
r.raise_for_status()
if 300 < r.status_code < 400:
url = r.headers.get('Location', url)
return url
Usage (short URL is this question's url):
short_url = 'https://tinyurl.com/' + '4d4ytpbx'
full_url = expand_short_url(short_url)
print(full_url)
Output:
https://stackoverflow.com/questions/20475552/python-requests-library-redirect-new-url
I wasn't able to use requests library and had to go different way. Here is the code that I post as solution to this post. (To get redirected URL with requests)
This way you actually open the browser, wait for your browser to log the url in the history log and then read last url in your history. I wrote this code for google chrom, but you should be able to follow along if you are using different browser.
import webbrowser
import sqlite3
import pandas as pd
import shutil
webbrowser.open("https://twitter.com/i/user/2274951674")
#source file is where the history of your webbroser is saved, I was using chrome, but it should be the same process if you are using different browser
source_file = 'C:\\Users\\{your_user_id}\\AppData\\Local\\Google\\Chrome\\User Data\\Default\\History'
# could not directly connect to history file as it was locked and had to make a copy of it in different location
destination_file = 'C:\\Users\\{user}\\Downloads\\History'
time.sleep(30) # there is some delay to update the history file, so 30 sec wait give it enough time to make sure your last url get logged
shutil.copy(source_file,destination_file) # copying the file.
con = sqlite3.connect('C:\\Users\\{user}\\Downloads\\History')#connecting to browser history
cursor = con.execute("SELECT * FROM urls")
names = [description[0] for description in cursor.description]
urls = cursor.fetchall()
con.close()
df_history = pd.DataFrame(urls,columns=names)
last_url = df_history.loc[len(df_history)-1,'url']
print(last_url)
>>https://twitter.com/ozanbayram01
All the answers are applicable where the final url exists/working fine.
In case, final URL doesn't seems to work then below is way to capture all redirects.
There was scenario where final URL isn't working anymore and other ways like url history give error.
Code Snippet
long_url = ''
url = 'http://example.com/bla-bla'
try:
while True:
long_url = requests.head(url).headers['location']
print(long_url)
url = long_url
except:
print(long_url)
As in this post, I attempt to get the final redirect of a webpage as:
import urllib.request
response = urllib.request.urlopen(url)
response.geturl()
But this doesn't work as I get the "HTTPError: HTTP Error 300: Multiple Choices" error when attempting to use urlopen.
See documentation for these methods here.
EDIT:
This problem is different than the Python: urllib2.HTTPError: HTTP Error 300: Multiple Choices question, because they skip the error-causing pages, while I have to obtain the final destination.
As suggested by #abccd, I used the requests library. So I will describe the solution.
import requests
url_base = 'something' # You need this because the redirect URL is relative.
url = url_base + 'somethingelse'
response = requests.get(url)
# Check if the request returned with the 300 error code.
if response.status_code == 300:
redirect_url = url_base + response.headers['Location'] # Get new URL.
response = requests.get(redirect_url) # Make a new request.
I've been looking through the Python Requests documentation but I cannot see any functionality for what I am trying to achieve.
In my script I am setting allow_redirects=True.
I would like to know if the page has been redirected to something else, what is the new URL.
For example, if the start URL was: www.google.com/redirect
And the final URL is www.google.co.uk/redirected
How do I get that URL?
You are looking for the request history.
The response.history attribute is a list of responses that led to the final URL, which can be found in response.url.
response = requests.get(someurl)
if response.history:
print("Request was redirected")
for resp in response.history:
print(resp.status_code, resp.url)
print("Final destination:")
print(response.status_code, response.url)
else:
print("Request was not redirected")
Demo:
>>> import requests
>>> response = requests.get('http://httpbin.org/redirect/3')
>>> response.history
(<Response [302]>, <Response [302]>, <Response [302]>)
>>> for resp in response.history:
... print(resp.status_code, resp.url)
...
302 http://httpbin.org/redirect/3
302 http://httpbin.org/redirect/2
302 http://httpbin.org/redirect/1
>>> print(response.status_code, response.url)
200 http://httpbin.org/get
This is answering a slightly different question, but since I got stuck on this myself, I hope it might be useful for someone else.
If you want to use allow_redirects=False and get directly to the first redirect object, rather than following a chain of them, and you just want to get the redirect location directly out of the 302 response object, then r.url won't work. Instead, it's the "Location" header:
r = requests.get('http://github.com/', allow_redirects=False)
r.status_code # 302
r.url # http://github.com, not https.
r.headers['Location'] # https://github.com/ -- the redirect destination
I think requests.head instead of requests.get will be more safe to call when handling url redirect. Check a GitHub issue here:
r = requests.head(url, allow_redirects=True)
print(r.url)
the documentation has this blurb https://requests.readthedocs.io/en/master/user/quickstart/#redirection-and-history
import requests
r = requests.get('http://www.github.com')
r.url
#returns https://www.github.com instead of the http page you asked for
For python3.5, you can use the following code:
import urllib.request
res = urllib.request.urlopen(starturl)
finalurl = res.geturl()
print(finalurl)
I wrote the following function to get the full URL from a short URL (bit.ly, t.co, ...)
import requests
def expand_short_url(url):
r = requests.head(url, allow_redirects=False)
r.raise_for_status()
if 300 < r.status_code < 400:
url = r.headers.get('Location', url)
return url
Usage (short URL is this question's url):
short_url = 'https://tinyurl.com/' + '4d4ytpbx'
full_url = expand_short_url(short_url)
print(full_url)
Output:
https://stackoverflow.com/questions/20475552/python-requests-library-redirect-new-url
I wasn't able to use requests library and had to go different way. Here is the code that I post as solution to this post. (To get redirected URL with requests)
This way you actually open the browser, wait for your browser to log the url in the history log and then read last url in your history. I wrote this code for google chrom, but you should be able to follow along if you are using different browser.
import webbrowser
import sqlite3
import pandas as pd
import shutil
webbrowser.open("https://twitter.com/i/user/2274951674")
#source file is where the history of your webbroser is saved, I was using chrome, but it should be the same process if you are using different browser
source_file = 'C:\\Users\\{your_user_id}\\AppData\\Local\\Google\\Chrome\\User Data\\Default\\History'
# could not directly connect to history file as it was locked and had to make a copy of it in different location
destination_file = 'C:\\Users\\{user}\\Downloads\\History'
time.sleep(30) # there is some delay to update the history file, so 30 sec wait give it enough time to make sure your last url get logged
shutil.copy(source_file,destination_file) # copying the file.
con = sqlite3.connect('C:\\Users\\{user}\\Downloads\\History')#connecting to browser history
cursor = con.execute("SELECT * FROM urls")
names = [description[0] for description in cursor.description]
urls = cursor.fetchall()
con.close()
df_history = pd.DataFrame(urls,columns=names)
last_url = df_history.loc[len(df_history)-1,'url']
print(last_url)
>>https://twitter.com/ozanbayram01
All the answers are applicable where the final url exists/working fine.
In case, final URL doesn't seems to work then below is way to capture all redirects.
There was scenario where final URL isn't working anymore and other ways like url history give error.
Code Snippet
long_url = ''
url = 'http://example.com/bla-bla'
try:
while True:
long_url = requests.head(url).headers['location']
print(long_url)
url = long_url
except:
print(long_url)