Redirected URL in python - python

I'm trying to get redirected URL, but something doesn't work.
I tried two methods:
from urllib import request as uReq
import requests
#method 1
url_str = 'http://google.us/'
resp = req.urlopen(url_str)
print(resp.geturl())
#method 2
url_str = "http://google.us/"
resp = requests.get(url_str)
print(resp.url)
Both work and give result >>> https://www.google.com
However, when I try to add this URL: http://www.kontrakt.szczecin.pl/lista-ofert/?f_listingId=351238&f=&submit=Szukaj as url_string nothing happens. When one's go to this site via browser he'll get that link:
http://www.kontrakt.szczecin.pl/mieszkanie-wynajem-41m2-1850pln-janusza-kusocinskiego-centrum-szczecin-zachodniopomorskie,351238
It is important for me to get a link, because I need info from it.

With allow_redirects=False you can make the url to stay on the page which you want even though it was intended to redirect.
resp = requests.get(url_str, allow_redirects=False)
You can find more such usage here

Related

How do I get the URL of a redirect using Python [duplicate]

I've been looking through the Python Requests documentation but I cannot see any functionality for what I am trying to achieve.
In my script I am setting allow_redirects=True.
I would like to know if the page has been redirected to something else, what is the new URL.
For example, if the start URL was: www.google.com/redirect
And the final URL is www.google.co.uk/redirected
How do I get that URL?
You are looking for the request history.
The response.history attribute is a list of responses that led to the final URL, which can be found in response.url.
response = requests.get(someurl)
if response.history:
print("Request was redirected")
for resp in response.history:
print(resp.status_code, resp.url)
print("Final destination:")
print(response.status_code, response.url)
else:
print("Request was not redirected")
Demo:
>>> import requests
>>> response = requests.get('http://httpbin.org/redirect/3')
>>> response.history
(<Response [302]>, <Response [302]>, <Response [302]>)
>>> for resp in response.history:
... print(resp.status_code, resp.url)
...
302 http://httpbin.org/redirect/3
302 http://httpbin.org/redirect/2
302 http://httpbin.org/redirect/1
>>> print(response.status_code, response.url)
200 http://httpbin.org/get
This is answering a slightly different question, but since I got stuck on this myself, I hope it might be useful for someone else.
If you want to use allow_redirects=False and get directly to the first redirect object, rather than following a chain of them, and you just want to get the redirect location directly out of the 302 response object, then r.url won't work. Instead, it's the "Location" header:
r = requests.get('http://github.com/', allow_redirects=False)
r.status_code # 302
r.url # http://github.com, not https.
r.headers['Location'] # https://github.com/ -- the redirect destination
I think requests.head instead of requests.get will be more safe to call when handling url redirect. Check a GitHub issue here:
r = requests.head(url, allow_redirects=True)
print(r.url)
the documentation has this blurb https://requests.readthedocs.io/en/master/user/quickstart/#redirection-and-history
import requests
r = requests.get('http://www.github.com')
r.url
#returns https://www.github.com instead of the http page you asked for
For python3.5, you can use the following code:
import urllib.request
res = urllib.request.urlopen(starturl)
finalurl = res.geturl()
print(finalurl)
I wrote the following function to get the full URL from a short URL (bit.ly, t.co, ...)
import requests
def expand_short_url(url):
r = requests.head(url, allow_redirects=False)
r.raise_for_status()
if 300 < r.status_code < 400:
url = r.headers.get('Location', url)
return url
Usage (short URL is this question's url):
short_url = 'https://tinyurl.com/' + '4d4ytpbx'
full_url = expand_short_url(short_url)
print(full_url)
Output:
https://stackoverflow.com/questions/20475552/python-requests-library-redirect-new-url
I wasn't able to use requests library and had to go different way. Here is the code that I post as solution to this post. (To get redirected URL with requests)
This way you actually open the browser, wait for your browser to log the url in the history log and then read last url in your history. I wrote this code for google chrom, but you should be able to follow along if you are using different browser.
import webbrowser
import sqlite3
import pandas as pd
import shutil
webbrowser.open("https://twitter.com/i/user/2274951674")
#source file is where the history of your webbroser is saved, I was using chrome, but it should be the same process if you are using different browser
source_file = 'C:\\Users\\{your_user_id}\\AppData\\Local\\Google\\Chrome\\User Data\\Default\\History'
# could not directly connect to history file as it was locked and had to make a copy of it in different location
destination_file = 'C:\\Users\\{user}\\Downloads\\History'
time.sleep(30) # there is some delay to update the history file, so 30 sec wait give it enough time to make sure your last url get logged
shutil.copy(source_file,destination_file) # copying the file.
con = sqlite3.connect('C:\\Users\\{user}\\Downloads\\History')#connecting to browser history
cursor = con.execute("SELECT * FROM urls")
names = [description[0] for description in cursor.description]
urls = cursor.fetchall()
con.close()
df_history = pd.DataFrame(urls,columns=names)
last_url = df_history.loc[len(df_history)-1,'url']
print(last_url)
>>https://twitter.com/ozanbayram01
All the answers are applicable where the final url exists/working fine.
In case, final URL doesn't seems to work then below is way to capture all redirects.
There was scenario where final URL isn't working anymore and other ways like url history give error.
Code Snippet
long_url = ''
url = 'http://example.com/bla-bla'
try:
while True:
long_url = requests.head(url).headers['location']
print(long_url)
url = long_url
except:
print(long_url)

Python requests with redirection on Heroku

I'm making a get request with the requests module of Python. This request returns a 302 code and redirects to another url, from local I am able to capture that new url with:
r = requests.get(URL)
finalURL = r.url
But when this code is executed in Heroku, the redirection is not carried out and the original url is returned to me.
I've tried everything, including forcing the redirection with this code:
r = requests.get(URL, allow_redirects=True)
I have also tried to pick up the url from the response headers, such as location or X-Originating-URL, but when the request is made from Heroku, the response does not return those values in the header.
Redirects are on by default, try viewing content of it, there has to be something that is redirecting u also u can go try and fiddle it.

Using requests.get to get redirected url

I have a super long url, and I'm trying to print its final destination. I've got:
import requests
url = "https://l.facebook.com/l.php?u=https%3A%2F%2Fwww.washingtonpost.com%2Fscience%2F2020%2F04%2F29%2Fcoronavirus-detection-dogs%2F%3Ffbclid%3DIwAR00eT4EHsWC9986GUSox_7JS7IIg2wAan-tB-NteYJd8I4xckmxnfaNGEI&h=AT0cs4gTKPZlkSElC2uhoDYR98lsONooq_ZUFIK87khBmtZE_3r8j25EfioBPAdp-O8o7efRVG9uB-doy9vLT-AccZMrxnfpEiSYRmA2LTL21IU15bP_PTVw4SSibS1A_uE8bU-ROJexKgdk68VSTtE&__tn__=H-R&c[0]=AT3BNcTNFE13IJu3naJmxTRdJTWtO4O4L0_-nimmzcXpYv3N536YRpQZLg-v2FtP_Oz2DZZpBN6XQPb89JNJTsYFXlK8-1g4xdDLi1T_lfowpI5Ooh8kuLpciLiQ9t-ZmMd2CTUWaGZ_Y_JU0OEvVWfLLfjDq4VOzUtETBcvXHw2ZvQnTQ"
r = requests.get(url, allow_redirects=False)
print(r.headers['Location'])
It should get me to https://www.washingtonpost.com/science/2020/04/29/coronavirus-detection-dogs/?fbclid=IwAR00eT4EHsWC9986GUSox_7JS7IIg2wAan-tB-NteYJd8I4xckmxnfaNGEI. But I get the same URL I put in.
(By the way, if anyone happens to know how to do this in Javascript, that would be awesome, but Google tells me that's not possible.)
Since you just want to get the URL, requests here cant be that much of help. Instead, you can use urllib.parse:
import urllib.parse as url_parse
url = <https://l.facebook.com/l.php?u=https...>
news_link = url_parse.unquote(url).split("?u=")[1]
# if you wish to delete Facebook Id, you can add this too
news_link = url_parse.unquote(url).split("?u=")[1].split("?fbclid")[0]
print(news_link)

Python - lxml - get current url address [duplicate]

I've been looking through the Python Requests documentation but I cannot see any functionality for what I am trying to achieve.
In my script I am setting allow_redirects=True.
I would like to know if the page has been redirected to something else, what is the new URL.
For example, if the start URL was: www.google.com/redirect
And the final URL is www.google.co.uk/redirected
How do I get that URL?
You are looking for the request history.
The response.history attribute is a list of responses that led to the final URL, which can be found in response.url.
response = requests.get(someurl)
if response.history:
print("Request was redirected")
for resp in response.history:
print(resp.status_code, resp.url)
print("Final destination:")
print(response.status_code, response.url)
else:
print("Request was not redirected")
Demo:
>>> import requests
>>> response = requests.get('http://httpbin.org/redirect/3')
>>> response.history
(<Response [302]>, <Response [302]>, <Response [302]>)
>>> for resp in response.history:
... print(resp.status_code, resp.url)
...
302 http://httpbin.org/redirect/3
302 http://httpbin.org/redirect/2
302 http://httpbin.org/redirect/1
>>> print(response.status_code, response.url)
200 http://httpbin.org/get
This is answering a slightly different question, but since I got stuck on this myself, I hope it might be useful for someone else.
If you want to use allow_redirects=False and get directly to the first redirect object, rather than following a chain of them, and you just want to get the redirect location directly out of the 302 response object, then r.url won't work. Instead, it's the "Location" header:
r = requests.get('http://github.com/', allow_redirects=False)
r.status_code # 302
r.url # http://github.com, not https.
r.headers['Location'] # https://github.com/ -- the redirect destination
I think requests.head instead of requests.get will be more safe to call when handling url redirect. Check a GitHub issue here:
r = requests.head(url, allow_redirects=True)
print(r.url)
the documentation has this blurb https://requests.readthedocs.io/en/master/user/quickstart/#redirection-and-history
import requests
r = requests.get('http://www.github.com')
r.url
#returns https://www.github.com instead of the http page you asked for
For python3.5, you can use the following code:
import urllib.request
res = urllib.request.urlopen(starturl)
finalurl = res.geturl()
print(finalurl)
I wrote the following function to get the full URL from a short URL (bit.ly, t.co, ...)
import requests
def expand_short_url(url):
r = requests.head(url, allow_redirects=False)
r.raise_for_status()
if 300 < r.status_code < 400:
url = r.headers.get('Location', url)
return url
Usage (short URL is this question's url):
short_url = 'https://tinyurl.com/' + '4d4ytpbx'
full_url = expand_short_url(short_url)
print(full_url)
Output:
https://stackoverflow.com/questions/20475552/python-requests-library-redirect-new-url
I wasn't able to use requests library and had to go different way. Here is the code that I post as solution to this post. (To get redirected URL with requests)
This way you actually open the browser, wait for your browser to log the url in the history log and then read last url in your history. I wrote this code for google chrom, but you should be able to follow along if you are using different browser.
import webbrowser
import sqlite3
import pandas as pd
import shutil
webbrowser.open("https://twitter.com/i/user/2274951674")
#source file is where the history of your webbroser is saved, I was using chrome, but it should be the same process if you are using different browser
source_file = 'C:\\Users\\{your_user_id}\\AppData\\Local\\Google\\Chrome\\User Data\\Default\\History'
# could not directly connect to history file as it was locked and had to make a copy of it in different location
destination_file = 'C:\\Users\\{user}\\Downloads\\History'
time.sleep(30) # there is some delay to update the history file, so 30 sec wait give it enough time to make sure your last url get logged
shutil.copy(source_file,destination_file) # copying the file.
con = sqlite3.connect('C:\\Users\\{user}\\Downloads\\History')#connecting to browser history
cursor = con.execute("SELECT * FROM urls")
names = [description[0] for description in cursor.description]
urls = cursor.fetchall()
con.close()
df_history = pd.DataFrame(urls,columns=names)
last_url = df_history.loc[len(df_history)-1,'url']
print(last_url)
>>https://twitter.com/ozanbayram01
All the answers are applicable where the final url exists/working fine.
In case, final URL doesn't seems to work then below is way to capture all redirects.
There was scenario where final URL isn't working anymore and other ways like url history give error.
Code Snippet
long_url = ''
url = 'http://example.com/bla-bla'
try:
while True:
long_url = requests.head(url).headers['location']
print(long_url)
url = long_url
except:
print(long_url)

Python Requests library redirect new url

I've been looking through the Python Requests documentation but I cannot see any functionality for what I am trying to achieve.
In my script I am setting allow_redirects=True.
I would like to know if the page has been redirected to something else, what is the new URL.
For example, if the start URL was: www.google.com/redirect
And the final URL is www.google.co.uk/redirected
How do I get that URL?
You are looking for the request history.
The response.history attribute is a list of responses that led to the final URL, which can be found in response.url.
response = requests.get(someurl)
if response.history:
print("Request was redirected")
for resp in response.history:
print(resp.status_code, resp.url)
print("Final destination:")
print(response.status_code, response.url)
else:
print("Request was not redirected")
Demo:
>>> import requests
>>> response = requests.get('http://httpbin.org/redirect/3')
>>> response.history
(<Response [302]>, <Response [302]>, <Response [302]>)
>>> for resp in response.history:
... print(resp.status_code, resp.url)
...
302 http://httpbin.org/redirect/3
302 http://httpbin.org/redirect/2
302 http://httpbin.org/redirect/1
>>> print(response.status_code, response.url)
200 http://httpbin.org/get
This is answering a slightly different question, but since I got stuck on this myself, I hope it might be useful for someone else.
If you want to use allow_redirects=False and get directly to the first redirect object, rather than following a chain of them, and you just want to get the redirect location directly out of the 302 response object, then r.url won't work. Instead, it's the "Location" header:
r = requests.get('http://github.com/', allow_redirects=False)
r.status_code # 302
r.url # http://github.com, not https.
r.headers['Location'] # https://github.com/ -- the redirect destination
I think requests.head instead of requests.get will be more safe to call when handling url redirect. Check a GitHub issue here:
r = requests.head(url, allow_redirects=True)
print(r.url)
the documentation has this blurb https://requests.readthedocs.io/en/master/user/quickstart/#redirection-and-history
import requests
r = requests.get('http://www.github.com')
r.url
#returns https://www.github.com instead of the http page you asked for
For python3.5, you can use the following code:
import urllib.request
res = urllib.request.urlopen(starturl)
finalurl = res.geturl()
print(finalurl)
I wrote the following function to get the full URL from a short URL (bit.ly, t.co, ...)
import requests
def expand_short_url(url):
r = requests.head(url, allow_redirects=False)
r.raise_for_status()
if 300 < r.status_code < 400:
url = r.headers.get('Location', url)
return url
Usage (short URL is this question's url):
short_url = 'https://tinyurl.com/' + '4d4ytpbx'
full_url = expand_short_url(short_url)
print(full_url)
Output:
https://stackoverflow.com/questions/20475552/python-requests-library-redirect-new-url
I wasn't able to use requests library and had to go different way. Here is the code that I post as solution to this post. (To get redirected URL with requests)
This way you actually open the browser, wait for your browser to log the url in the history log and then read last url in your history. I wrote this code for google chrom, but you should be able to follow along if you are using different browser.
import webbrowser
import sqlite3
import pandas as pd
import shutil
webbrowser.open("https://twitter.com/i/user/2274951674")
#source file is where the history of your webbroser is saved, I was using chrome, but it should be the same process if you are using different browser
source_file = 'C:\\Users\\{your_user_id}\\AppData\\Local\\Google\\Chrome\\User Data\\Default\\History'
# could not directly connect to history file as it was locked and had to make a copy of it in different location
destination_file = 'C:\\Users\\{user}\\Downloads\\History'
time.sleep(30) # there is some delay to update the history file, so 30 sec wait give it enough time to make sure your last url get logged
shutil.copy(source_file,destination_file) # copying the file.
con = sqlite3.connect('C:\\Users\\{user}\\Downloads\\History')#connecting to browser history
cursor = con.execute("SELECT * FROM urls")
names = [description[0] for description in cursor.description]
urls = cursor.fetchall()
con.close()
df_history = pd.DataFrame(urls,columns=names)
last_url = df_history.loc[len(df_history)-1,'url']
print(last_url)
>>https://twitter.com/ozanbayram01
All the answers are applicable where the final url exists/working fine.
In case, final URL doesn't seems to work then below is way to capture all redirects.
There was scenario where final URL isn't working anymore and other ways like url history give error.
Code Snippet
long_url = ''
url = 'http://example.com/bla-bla'
try:
while True:
long_url = requests.head(url).headers['location']
print(long_url)
url = long_url
except:
print(long_url)

Categories