Resolve masked/shortened URL twint is scraping from twitter

Resolve masked/shortened URL twint is scraping from twitter - python

I am using twint for scraping twitter profiles.
When I run this script:
c = twint.Config()
c.Username = username
c.Store_object = True
c.Store_object_users_list = users
c.Hide_output = True
twint.run.Lookup(c)
try:
userna = users[0]
except:
continue
web = userna.url
I get the masked/shortened URL instead of a real one. How can I get the real url?
What would you advise?

Preface: This answer is based on my results and findings from evaluating UVuuMe's answer (in it's initial version).
To translate a shortened URL into the full URL it represents, you can use the package requests which comes indirectly with twint (it's required by googletransx which is required by twint which you already installed, so there is no need for pip install requests).
Send a HEAD request, then check the response's status_code for 303, only then read the location header; there are other cases where a location will be in the response, but not in case of HTTP 200 OK.
import requests
# Short URL for a Python Requests Tutorial
short_url = 'https://youtu.be/tb8gHvYlCFs'
res = requests.head(short_url)
if res.status_code == 303: # "See Other"
full_url = res.headers['location']
elif res.status_code == 200: # "OK"
# let's conclude that short_url is already what we are looking for
full_url = short_url
else:
# replace by your error handling:
assert(False)

The following works.
import requests
resp = requests.head(short_link)
resp.status_code
true_url = resp.headers["Location"]

Related

How do I get the URL of a redirect using Python [duplicate]

I've been looking through the Python Requests documentation but I cannot see any functionality for what I am trying to achieve.
In my script I am setting allow_redirects=True.
I would like to know if the page has been redirected to something else, what is the new URL.
For example, if the start URL was: www.google.com/redirect
And the final URL is www.google.co.uk/redirected
How do I get that URL?

You are looking for the request history.
The response.history attribute is a list of responses that led to the final URL, which can be found in response.url.
response = requests.get(someurl)
if response.history:
print("Request was redirected")
for resp in response.history:
print(resp.status_code, resp.url)
print("Final destination:")
print(response.status_code, response.url)
else:
print("Request was not redirected")
Demo:
>>> import requests
>>> response = requests.get('http://httpbin.org/redirect/3')
>>> response.history
(<Response [302]>, <Response [302]>, <Response [302]>)
>>> for resp in response.history:
... print(resp.status_code, resp.url)
...
302 http://httpbin.org/redirect/3
302 http://httpbin.org/redirect/2
302 http://httpbin.org/redirect/1
>>> print(response.status_code, response.url)
200 http://httpbin.org/get

This is answering a slightly different question, but since I got stuck on this myself, I hope it might be useful for someone else.
If you want to use allow_redirects=False and get directly to the first redirect object, rather than following a chain of them, and you just want to get the redirect location directly out of the 302 response object, then r.url won't work. Instead, it's the "Location" header:
r = requests.get('http://github.com/', allow_redirects=False)
r.status_code # 302
r.url # http://github.com, not https.
r.headers['Location'] # https://github.com/ -- the redirect destination

I think requests.head instead of requests.get will be more safe to call when handling url redirect. Check a GitHub issue here:
r = requests.head(url, allow_redirects=True)
print(r.url)

the documentation has this blurb https://requests.readthedocs.io/en/master/user/quickstart/#redirection-and-history
import requests
r = requests.get('http://www.github.com')
r.url
#returns https://www.github.com instead of the http page you asked for

For python3.5, you can use the following code:
import urllib.request
res = urllib.request.urlopen(starturl)
finalurl = res.geturl()
print(finalurl)

I wrote the following function to get the full URL from a short URL (bit.ly, t.co, ...)
import requests
def expand_short_url(url):
r = requests.head(url, allow_redirects=False)
r.raise_for_status()
if 300 < r.status_code < 400:
url = r.headers.get('Location', url)
return url
Usage (short URL is this question's url):
short_url = 'https://tinyurl.com/' + '4d4ytpbx'
full_url = expand_short_url(short_url)
print(full_url)
Output:
https://stackoverflow.com/questions/20475552/python-requests-library-redirect-new-url

I wasn't able to use requests library and had to go different way. Here is the code that I post as solution to this post. (To get redirected URL with requests)
This way you actually open the browser, wait for your browser to log the url in the history log and then read last url in your history. I wrote this code for google chrom, but you should be able to follow along if you are using different browser.
import webbrowser
import sqlite3
import pandas as pd
import shutil
webbrowser.open("https://twitter.com/i/user/2274951674")
#source file is where the history of your webbroser is saved, I was using chrome, but it should be the same process if you are using different browser
source_file = 'C:\\Users\\{your_user_id}\\AppData\\Local\\Google\\Chrome\\User Data\\Default\\History'
# could not directly connect to history file as it was locked and had to make a copy of it in different location
destination_file = 'C:\\Users\\{user}\\Downloads\\History'
time.sleep(30) # there is some delay to update the history file, so 30 sec wait give it enough time to make sure your last url get logged
shutil.copy(source_file,destination_file) # copying the file.
con = sqlite3.connect('C:\\Users\\{user}\\Downloads\\History')#connecting to browser history
cursor = con.execute("SELECT * FROM urls")
names = [description[0] for description in cursor.description]
urls = cursor.fetchall()
con.close()
df_history = pd.DataFrame(urls,columns=names)
last_url = df_history.loc[len(df_history)-1,'url']
print(last_url)
>>https://twitter.com/ozanbayram01

All the answers are applicable where the final url exists/working fine.
In case, final URL doesn't seems to work then below is way to capture all redirects.
There was scenario where final URL isn't working anymore and other ways like url history give error.
Code Snippet
long_url = ''
url = 'http://example.com/bla-bla'
try:
while True:
long_url = requests.head(url).headers['location']
print(long_url)
url = long_url
except:
print(long_url)

How to edit a facebook post in Python?

I have a post on my fb page which I need to update several times a day with data elaborated in a python script. I tried using Selenium, but it gets often stuck when saving the post hence the script gets stuck too, so I'm trying to find a way to do the job within python itself without using a web browser.
I wonder is there a way to edit a FB post using a python library such as Facepy or similar?
I'm reading the graph API reference but there are no examples to learn from, but I guess first thing is to set up the login. On the facepy github page is written that
note that Facepy does not do authentication with Facebook; it only consumes its API. To get an access token to consume the API on behalf of a user, use a suitable OAuth library for your platform
I tried logging in with BeautifulSoup
from bs4 import BeautifulSoup
import requests
import re
def facebook_login(mail, pwd):
session = requests.Session()
r = session.get('https://www.facebook.com/', allow_redirects=False)
soup = BeautifulSoup(r.text)
action_url = soup.find('form', id='login_form')['action']
inputs = soup.find('form', id='login_form').findAll('input', {'type': ['hidden', 'submit']})
post_data = {input.get('name'): input.get('value') for input in inputs}
post_data['email'] = mail
post_data['pass'] = pwd.upper()
scripts = soup.findAll('script')
scripts_string = '/n/'.join([script.text for script in scripts])
datr_search = re.search('\["_js_datr","([^"]*)"', scripts_string, re.DOTALL)
if datr_search:
datr = datr_search.group(1)
cookies = {'_js_datr' : datr}
else:
return False
return session.post(action_url, data=post_data, cookies=cookies, allow_redirects=False)
facebook_login('email', 'psw')
but it gives this error
action_url = soup.find('form', id='login_form')['action']
TypeError: 'NoneType' object is not subscriptable
I also tried with Mechanize
import mechanize
username = 'email'
password = 'psw'
url = 'http://facebook.com/login'
print("opening browser")
br = mechanize.Browser()
print("opening url...please wait")
br.open(url)
print(br.title())
print("selecting form")
br.select_form(name='Login')
br['UserID'] = username
br['PassPhrase'] = password
print("submitting form"
br.submit()
response = br.submit()
pageSource = response.read()
but it gives an error too
mechanize._response.httperror_seek_wrapper: HTTP Error 403: b'request disallowed by robots.txt'

Install the facebook package
pip install facebook-sdk
then to update/edit a post on your page just run
import facebook
page_token = '...'
page_id = '...'
post_id = '...'
fb = facebook.GraphAPI(access_token = page_token, version="2.12")
fb.put_object(parent_object=page_id+'_'+post_id, connection_name='', message='new text')

python get url from request

i get data from an api in django.
The data comes from an order form from another website.
The data also includes an url, for example like example.com but i can't validate the input because i don't have access to the order form.
The url that i get can also have different kinds. More examples:
example.de
http://example.de
www.example.com
https://example.de
http://www.example.de
https://www.example.de
Now i would like to open the url to get the correct url.
For example if i open example.com in my browser, i got the correct url http://example.com/ and that is what i wish for all urls.
How can i do that in python fast?

If you get status_code 200 you know that you have a valid address.
In regards to HTTPS://. You will get an SSL error if you don't Follow the answers in this guide. Once you have that in place, the program will find the correct URL for you.
import requests
import traceback
validProtocols = ["https://www.", "http://www.", "https://", "http://"]
def removeAnyProtocol(url):
url = url.replace("www.","") # to remove any inputs containing just www since we aren't planning on using them regardless.
for protocol in validProtocols:
url = url.replace(protocol, "")
return url
def validateUrl(url):
for protocol in validProtocols:
if(protocol not in url):
pUrl = protocol + removeAnyProtocol(url)
try:
req = requests.head(pUrl, allow_redirects=True)
if req.status_code == 200:
return pUrl
else:
continue
except Exception:
print(traceback.format_exc())
continue
else:
try:
req = requests.head(url, allow_redirects=True)
if req.status_code == 200:
return url
except Exception:
print(traceback.format_exc())
continue
Usage:
correctUrl = validateUrl("google.com") # https://www.google.com

Python - lxml - get current url address [duplicate]

I've been looking through the Python Requests documentation but I cannot see any functionality for what I am trying to achieve.
In my script I am setting allow_redirects=True.
I would like to know if the page has been redirected to something else, what is the new URL.
For example, if the start URL was: www.google.com/redirect
And the final URL is www.google.co.uk/redirected
How do I get that URL?

You are looking for the request history.
The response.history attribute is a list of responses that led to the final URL, which can be found in response.url.
response = requests.get(someurl)
if response.history:
print("Request was redirected")
for resp in response.history:
print(resp.status_code, resp.url)
print("Final destination:")
print(response.status_code, response.url)
else:
print("Request was not redirected")
Demo:
>>> import requests
>>> response = requests.get('http://httpbin.org/redirect/3')
>>> response.history
(<Response [302]>, <Response [302]>, <Response [302]>)
>>> for resp in response.history:
... print(resp.status_code, resp.url)
...
302 http://httpbin.org/redirect/3
302 http://httpbin.org/redirect/2
302 http://httpbin.org/redirect/1
>>> print(response.status_code, response.url)
200 http://httpbin.org/get

This is answering a slightly different question, but since I got stuck on this myself, I hope it might be useful for someone else.
If you want to use allow_redirects=False and get directly to the first redirect object, rather than following a chain of them, and you just want to get the redirect location directly out of the 302 response object, then r.url won't work. Instead, it's the "Location" header:
r = requests.get('http://github.com/', allow_redirects=False)
r.status_code # 302
r.url # http://github.com, not https.
r.headers['Location'] # https://github.com/ -- the redirect destination

I think requests.head instead of requests.get will be more safe to call when handling url redirect. Check a GitHub issue here:
r = requests.head(url, allow_redirects=True)
print(r.url)

the documentation has this blurb https://requests.readthedocs.io/en/master/user/quickstart/#redirection-and-history
import requests
r = requests.get('http://www.github.com')
r.url
#returns https://www.github.com instead of the http page you asked for

For python3.5, you can use the following code:
import urllib.request
res = urllib.request.urlopen(starturl)
finalurl = res.geturl()
print(finalurl)

I wrote the following function to get the full URL from a short URL (bit.ly, t.co, ...)
import requests
def expand_short_url(url):
r = requests.head(url, allow_redirects=False)
r.raise_for_status()
if 300 < r.status_code < 400:
url = r.headers.get('Location', url)
return url
Usage (short URL is this question's url):
short_url = 'https://tinyurl.com/' + '4d4ytpbx'
full_url = expand_short_url(short_url)
print(full_url)
Output:
https://stackoverflow.com/questions/20475552/python-requests-library-redirect-new-url

I wasn't able to use requests library and had to go different way. Here is the code that I post as solution to this post. (To get redirected URL with requests)
This way you actually open the browser, wait for your browser to log the url in the history log and then read last url in your history. I wrote this code for google chrom, but you should be able to follow along if you are using different browser.
import webbrowser
import sqlite3
import pandas as pd
import shutil
webbrowser.open("https://twitter.com/i/user/2274951674")
#source file is where the history of your webbroser is saved, I was using chrome, but it should be the same process if you are using different browser
source_file = 'C:\\Users\\{your_user_id}\\AppData\\Local\\Google\\Chrome\\User Data\\Default\\History'
# could not directly connect to history file as it was locked and had to make a copy of it in different location
destination_file = 'C:\\Users\\{user}\\Downloads\\History'
time.sleep(30) # there is some delay to update the history file, so 30 sec wait give it enough time to make sure your last url get logged
shutil.copy(source_file,destination_file) # copying the file.
con = sqlite3.connect('C:\\Users\\{user}\\Downloads\\History')#connecting to browser history
cursor = con.execute("SELECT * FROM urls")
names = [description[0] for description in cursor.description]
urls = cursor.fetchall()
con.close()
df_history = pd.DataFrame(urls,columns=names)
last_url = df_history.loc[len(df_history)-1,'url']
print(last_url)
>>https://twitter.com/ozanbayram01

All the answers are applicable where the final url exists/working fine.
In case, final URL doesn't seems to work then below is way to capture all redirects.
There was scenario where final URL isn't working anymore and other ways like url history give error.
Code Snippet
long_url = ''
url = 'http://example.com/bla-bla'
try:
while True:
long_url = requests.head(url).headers['location']
print(long_url)
url = long_url
except:
print(long_url)

Using urllib to get the final redirect of a webpage in Python 3.5

As in this post, I attempt to get the final redirect of a webpage as:
import urllib.request
response = urllib.request.urlopen(url)
response.geturl()
But this doesn't work as I get the "HTTPError: HTTP Error 300: Multiple Choices" error when attempting to use urlopen.
See documentation for these methods here.
EDIT:
This problem is different than the Python: urllib2.HTTPError: HTTP Error 300: Multiple Choices question, because they skip the error-causing pages, while I have to obtain the final destination.

As suggested by #abccd, I used the requests library. So I will describe the solution.
import requests
url_base = 'something' # You need this because the redirect URL is relative.
url = url_base + 'somethingelse'
response = requests.get(url)
# Check if the request returned with the 300 error code.
if response.status_code == 300:
redirect_url = url_base + response.headers['Location'] # Get new URL.
response = requests.get(redirect_url) # Make a new request.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Resolve masked/shortened URL twint is scraping from twitter - python

The following works. import requests resp = requests.head(short_link) resp.status_code true_url = resp.headers["Location"]

Related

How do I get the URL of a redirect using Python [duplicate]

How to edit a facebook post in Python?

python get url from request

Python - lxml - get current url address [duplicate]

Using urllib to get the final redirect of a webpage in Python 3.5

Categories

Resources