Extracting unique URLs in Python - python

I would like to extract entire unique url items in my list in order to move on a web scraping project. Although I have huge list of URLs on my side, I would like to generate here minimalist scenario to explain main issue on my side. Assume that my list is like that:
url_list = ["https://www.ox.ac.uk/",
"http://www.ox.ac.uk/",
"https://www.ox.ac.uk",
"http://www.ox.ac.uk",
"https://www.ox.ac.uk/index.php",
"https://www.ox.ac.uk/index.html",
"http://www.ox.ac.uk/index.php",
"http://www.ox.ac.uk/index.html",
"www.ox.ac.uk/",
"ox.ac.uk",
"https://www.ox.ac.uk/research"
]
def ExtractUniqueUrls(urls):
pass
ExtractUniqueUrls(url_list)
For the minimalist scenario, I am expecting there are only two unique urls which are "https://www.ox.ac.uk" and "https://www.ox.ac.uk/research". Although each url element have some differences such as "http", "https", with ending "/", without ending "/", index.php, index.html; they are all pointing exactly the same web page. There might be some other possibilities which I already missed them (Please remember them if you catch any). Anyway, what is the proper and efficient way to handle this issue using Python 3?
I am not looking for a hard-coded solution like focusing on each case individually. For instance, I do not want to manually check whether the url has "/" at the end or not. Possibly there is a much better solution with other packages such as urllib? For that reason, I looked the method of urllib.parse, but I could not come up a proper solution so far.
Thanks
Edit: I added one more example into my list at the end in order to explain in a better way. Otherwise, you might assume that I am looking for the root url, but this not the case at all.

By only following all cases you've reveiled:
url_list = ["https://www.ox.ac.uk/",
"http://www.ox.ac.uk/",
"https://www.ox.ac.uk",
"http://www.ox.ac.uk",
"https://www.ox.ac.uk/index.php",
"https://www.ox.ac.uk/index.html",
"http://www.ox.ac.uk/index.php",
"http://www.ox.ac.uk/index.html",
"www.ox.ac.uk/",
"ox.ac.uk",
"ox.ac.uk/research",
"ox.ac.uk/index.php?12"]
def url_strip_gen(source: list):
replace_dict = {".php": "", ".html": "", "http://": "", "https://": ""}
for url in source:
for key, val in replace_dict.items():
url = url.replace(key, val, 1)
url = url.rstrip('/')
yield url[4:] if url.startswith("www.") else url
print(set(url_strip_gen(url_list)))
{'ox.ac.uk/index?12', 'ox.ac.uk/index', 'ox.ac.uk/research', 'ox.ac.uk'}
This won't cover case if url contains .html like www.htmlsomething, in that case it can be compensated with urlparse as it stores path and url separately like below:
>>> import pprint
>>> from urllib.parse import urlparse
>>> a = urlparse("http://ox.ac.uk/index.php?12")
>>> pprint.pprint(a)
ParseResult(scheme='http', netloc='ox.ac.uk', path='/index.php', params='', query='12', fragment='')
However, if without scheme:
>>> a = urlparse("ox.ac.uk/index.php?12")
>>> pprint.pprint(a)
ParseResult(scheme='', netloc='', path='ox.ac.uk/index.php', params='', query='12', fragment='')
All host goes to path attribute.
To compensate this we either need to remove scheme and add one for all or check if url starts with scheme else add one. Prior is easier to implement.
replace_dict = {"http://": "", "https://": ""}
for url in source:
# Unify scheme to HTTP
for key, val in replace_dict.items():
url = url.replace(key, val, 1)
url = "http://" + (url[4:] if url.startswith("www.") else url)
parsed = urlparse(url)
With this you are guaranteed to get separate control of each sections for your url via urlparse. However as you do not specified which parameter should be considered for url to be unique enough, I'll leave that task to you.

Here's a quick and dirty attempt:
def extract_unique_urls(url_list):
unique_urls = []
for url in url_list:
# Removing the 'https://' etc. part
if url.find('//') > -1:
url = url.split('//')[1]
# Removing the 'www.' part
url = url.replace('www.', '')
# Removing trailing '/'
url = url.rstrip('/')
# If not root url then inspect the last part of the url
if url.find('/') > -1:
# Extracting the last part
last_part = url.split('/')[-1]
# Deciding if to keep the last part (no if '.' in it)
if last_part.find('.') > -1:
# If no to keep: Removing last part and getting rid of
# trailing '/'
url = '/'.join(url.split('/')[:-1]).rstrip('/')
# Append if not already in list
if url not in unique_urls:
unique_urls.append(url)
# Sorting for the fun of it
return sorted(unique_urls)
I'm sure it doesn't cover all possible cases. But maybe you can extend it if that's not the case. I'm also not sure if you wanted to keep the 'http(s)://' parts. If yes, then just add them to the results.

Related

How to print only a specific link in Python

I'm still a newbie in Python but I'm trying to make my first little program.
My intention is to print only the link ending with .m3u8 (if available) istead of printing the whole web page.
The code I'm currently using:
import requests
channel1 = requests.get('https://website.tv/user/111111')
print(channel1.content)
print('\n')
channel2 = requests.get('https://website.tv/user/222222')
print(channel2.content)
print('\n')
input('Press Enter to Exit...')
The link I'm looking for always has 47 characters in total, and it's always the same model just changing the stream id represented as X:
https://website.tv/live/streamidXXXXXXXXX.m3u8
Can anyone help me?
You can use regex for this problem.
Explanation:
here in the expression portion .*? means to consider everything and whatever enclosed in \b(expr)\b needs to be present there mandatorily.
For e.g.:
import re
link="https://website.tv/live/streamidXXXXXXXXX.m3u8"
p=re.findall(r'.*?\b.m3u8\b',link)
print(p)
OUTPUT:
['https://website.tv/live/streamidXXXXXXXXX.m3u8']
There are a few ways to go about this, one that springs to mind which others have touched upon is using regex with findall that returns back a list of matched urls from our url_list.
Another option could also be BeautifulSoup but without more information regarding the html structure it may not be the best tool here.
Using Regex
from re import findall
from requests import get
def check_link(response):
result = findall(
r'.*?\b.m3u8\b',
str(response.content),
)
return result
def main(url):
response = get(url)
if response.ok:
link_found = check_link(response)
if link_found:
print('link {} found at {}'.format(
link_found,
url,
),
)
if __name__ == '__main__':
url_list = [
'http://www.test_1.com',
'http://www.test_2.com',
'http://www.test_3.com',
]
for url in url_list:
main(url)
print("All finished")
If I understand your question correctly I think you want to use Python's .split() string method. If your goal is to take a string like "https://website.tv/live/streamidXXXXXXXXX.m3u8" and extract just "streamidXXXXXXXXX.m3u8" then you could do that with the following code:
web_address = "https://website.tv/live/streamidXXXXXXXXX.m3u8"
specific_file = web_address.split('/')[-1]
print(specific_file)
The calling .split('/') on the string like that will return a list of strings where each item in the list is a different part of the string (first part being "https:", etc.). The last one of these (index [-1]) will be the file extension you want.
This will extract all URLs from webpage and filter only those which contain your required keyword ".m3u8"
import requests
import re
def get_desired_url(data):
urls = []
for url in re.findall(r'(https?://\S+)', data):
if ".m3u8" in url:
urls.append(url)
return urls
channel1 = requests.get('https://website.tv/user/111111')
urls = get_desired_url(channel1 )
Try this, I think this will be robust
import re
links=[re.sub('^<[ ]*a[ ]+.*href[ ]*=[ ]*', '', re.sub('.*>$', '', link) for link in re.findall(r'<[ ]*a[ ]+.*href[ ]*=[]*"http[s]*://.+\.m3u8".*>',channel2.content)]

Python Variable Mutation Best Practice

Let's say I'm passing a variable into a function, and I want to ensure it's properly formatted for my end use, with consideration for several potential unwanted formats.
Example; I want to store only lowercase representations of url addresses, without http:// or https://.
def standardize(url):
# Lowercase
temp_url = url
url = temp_url.lower()
# Remove 'http://'
if 'http://' in url:
temp_url = url
url = temp_url.replace('http://', '')
if 'https://' in url:
temp_url = url
url = temp_url.replace('https://', '')
return url
I'm only just encroaching on the title of Novice, and was wondering if there is more pythonic approach to achieving this type of process?
End goal being the trasformation of a url as such https://myurl.com/RANDoM --> myurl.com/random
The application of url string formating isn't of any particular importance.
A simple re.sub will do the trick:
import re
def standardize(url):
return re.sub("^https?://",'',url.lower())
# with 'https'
print(standardize('https://myurl.com/RANDoM')) # prints 'myurl.com/random'
# with 'http'
print(standardize('http://myurl.com/RANDoM')) # prints 'myurl.com/random'
# both works
def standardize(url):
return url.lower().replace("https://","").replace("http://","")
That's as simple as I can make it, but, the chaining is a little ugly.
If you want to import regex, could also do something like this:
import re
def standardize(url):
return re.sub("^https?://", "", url.lower())

How can I prepend http to a url if it doesn't begin with http?

I have urls formatted as:
google.com
www.google.com
http://google.com
http://www.google.com
I would like to convert all type of links to a uniform format, starting with http://
http://google.com
How can I prepend URLs with http:// using Python?
Python do have builtin functions to treat that correctly, like
p = urlparse.urlparse(my_url, 'http')
netloc = p.netloc or p.path
path = p.path if p.netloc else ''
if not netloc.startswith('www.'):
netloc = 'www.' + netloc
p = urlparse.ParseResult('http', netloc, path, *p[3:])
print(p.geturl())
If you want to remove (or add) the www part, you have to edit the .netloc field of the resulting object before calling .geturl().
Because ParseResult is a namedtuple, you cannot edit it in-place, but have to create a new object.
PS:
For Python3, it should be urllib.parse.urlparse
I found it easy to detect the protocol with regex and then append it if missing:
import re
def formaturl(url):
if not re.match('(?:http|ftp|https)://', url):
return 'http://{}'.format(url)
return url
url = 'test.com'
print(formaturl(url)) # http://test.com
url = 'https://test.com'
print(formaturl(url)) # https://test.com
I hope it helps!
For the formats that you mention in your question, you can do something as simple as:
def convert(url):
if url.startswith('http://www.'):
return 'http://' + url[len('http://www.'):]
if url.startswith('www.'):
return 'http://' + url[len('www.'):]
if not url.startswith('http://'):
return 'http://' + url
return url
But please note that there are probably other formats that you are not anticipating. In addition, keep in mind that the output URL (according to your definitions) will not necessarily be a valid one (i.e., the DNS will not be able to translate it into a valid IP address).
If you URLs are a string type you could just concatenate.
one = "https://"
two = "www.privateproperty.co.za"
link = "".join((one, two))
def fix_url(orig_link):
# force scheme
split_comps = urlsplit(orig_link, scheme='https')
# fix netloc (can happen when there is no scheme)
if not len(split_comps.netloc):
if len(split_comps.path):
# override components with fixed netloc and path
split_comps = SplitResult(scheme='https',netloc=split_comps.path,path='',query=split_comps.query,fragment=split_comps.fragment)
else: # no netloc, no path
raise ValueError
return urlunsplit(split_comps)

Fetch a particular part of the url in python

I am using python and trying to fetch a particular part of the url as below
from urlparse import urlparse as ue
url = "https://www.google.co.in"
img_url = ue(url).hostname
Result
www.google.co.in
case1:
Actually i will have a number of urls(stored in a list or some where else), so what i want is, need to find the domain name as above in the url and fetch the part after www. and before .co.in, that is the string starts after first dot and before second dot which results only google in the present scenario.
So suppose the url given is url given is www.gmail.com, i should fetch only gmail in that, so what ever the url given, the code should fetch the part thats starts with first dot and before second dot.
case2:
Also some urls may be given directly like this domain.com, stackoverflow.com without www in the url, in that cases it should fetch only stackoverflow and domain.
Finally my intention is to fetch the main name from the url that gmail, stackoverflow, google like so.....
Generally if i have one url i can use list slicing and will fetch the string, but i will have a number of ulrs, so need to fetch the wanted part like mentioned above dynamically
Can anyone please let me know how to satisfy the above concept ?
Why can't you just do this:
from urlparse import urlparse as ue
urls = ['https://www.google.com', 'http://stackoverflow.com']
parsed = []
for url in urls:
decoded = ue(url).hostname
if decoded.startswith('www.'):
decoded = ".".join(decoded.split('.')[1:])
parsed.append(decoded.split('.')[0])
#parsed is now your parsed list of hostnames
Also, you might want to change the if statement in the for loop, because some domains might start with other things that you would want to get rid of.
What about using a set of predefined toplevel doamains?
import re
from urlparse import urlparse
#Fake top level domains... EG: co.uk, co.in, co.cc
TOPLEVEL = [".co.[a-zA-Z]+", ".fake.[a-zA-Z]+"]
def TLD(rgx, host, max=4): #4 = co.name
match = re.findall("(%s)" % rgx, host, re.IGNORECASE)
if match:
if len(match[0].split(".")[1])<=max:
return match[0]
else:
return False
parsed = []
urls = ["http://www.mywebsite.xxx.asd.com", "http://www.dd.test.fake.uk/asd"]
for url in urls:
o = urlparse(url)
h = o.hostname
for j in range(len(TOPLEVEL)):
TL = TLD(TOPLEVEL[j], h)
if TL:
name = h.replace(TL, "").split(".")[-1]
parsed.append(name)
break
elif(j+1==len(TOPLEVEL)):
parsed.append(h.split(".")[-2])
break
print parsed
It's a bit hacky, and maybe cryptic for some, but it does the trick, and nothing more has to be done :)
Here is my solution, at the end, domains holds a list of domains you expected.
import urlparse
urls = [
'https://www.google.com',
'http://stackoverflow.com',
'http://www.google.co.in',
'http://domain.com',
]
hostnames = [urlparse.urlparse(url).hostname for url in urls]
hostparts = [hostname.split('.') for hostname in hostnames]
domains = [p[0] == 'www' and p[1] or p[0] for p in hostparts]
print domains # ==> ['google', 'stackoverflow', 'google', 'domain']
Discussion
First, we extract the host names from the list of URLs using urlparse.urlparse(). The hostnames list looks like this:
[ 'www.google.com', 'stackoverflow.com, ... ]
In the next line, we break each host into parts, using the dot as the separator. Each item in the hostparts looks like this:
[ ['www', 'google', 'com'], ['stackoverflow', 'com'], ... ]
The interesting work is in the next line. This line says, "if the first part before the dot is www, then the domain is the second part (p[1]). Otherwise, the domain is the first part (p[0]). The domains list looks like this:
[ 'google', 'stackoverflow', 'google', 'domain' ]
My code does not know how to handle login.gmail.com.hk. I hope someone else can solve this problem as I am late for bed. Update: Take a look at the tldextract by John Kurkowski, which should do what you want.

Partial matches in a dictionary

Assume I have the following dictionary mapping of domain names to it's human readable description
domain_info = {"google.com" : "A Search Engine",
"facebook.com" : "A Social Networking Site",
"stackoverflow.com" : "Q&A Site for Programmers"}
I would like to get the description from response.url which returns an absolute path http://www.google.com/reader/view/
My current approach
url = urlparse.urlparse(response.url)
domain = url.netloc # 'www.google.com'
domain = domain.split(".") # ['www', 'google', 'com']
domain = domain[-2:] # ['google', 'com']
domain = ".".join(domain) # 'google.com'
info = domain_info[domain]
seems to be too slow for large number of invocations, can anyone suggest an alternate way to speed things up?
An ideal solution would handle any subdomain and be case-insensitive
What does "too slow for large number of operations" mean? It's still going to work in constant time (for each URL) and you can't get any better than that. The above seems to be a perfectly good way to do it.
If you need it to be a bit faster (but it wouldn't be terribly faster), you could write your own regex. Something like "[a-zA-Z]+://([a-zA-Z0-9.]+)". That would get the full domain, not the subdomain. You would still need to do the domain splitting unless you can use lookahead in the regex to get just the last two segments. Be sure to use re.compile to make the regex itself fast.
Note that going domain[-2] is likely not going to be what you want. The logic of finding an appropriate "company level domain" is pretty complicated. For example, if the domain is google.com.au, this will give you "com.au" which is unlikely to be what you want -- you probably want "google.com.au".
As you say an ideal solution would handle any subdomain, you probably want to iterate over all the splits.
url = urlparse.urlparse(response.url)
domain = url.netloc # 'www.google.com'
domain = domain.split(".") # ['www', 'google', 'com']
info = None
for i in range(len(domain)):
subdomain = ".".join(domain[i:]) # 'www.google.com', 'google.com', 'com'
try:
info = domain_info[subdomain]
break
except KeyError:
pass
With the above code, you will find it if it matches any subdomain. As for case sensitivity, that is easy. Ensure all the keys in the dictionary are lowercase, and apply .lower() to the domain before all the other processing.
It seems like the urlparse.py in the Python 2.6 standard library does a bunch of things when calling the urlparse() function. It may be possible to speed things up by writing a little URL parser which does only what is absolutely necessary and no more.
UPDATE: see this part of Wikipedia's page about DNS for information on the syntax of domain names, it may give some ideas for the parser.
You may consider extracting the domain without sub-domains using regular expression:
'http:\/\/([^\.]+\.)*([^\.][a-zA-Z0-9\-]+\.[a-zA-Z]{2,6})(\/?|\/.*)'
import re
m = re.search('http:\/\/([^\.]+\.)*([^\.][a-zA-Z0-9\-]+\.[a-zA-Z]{2,6})(\/?|\/.*)', 'http://www.google.com/asd?#a')
print m.group(2)
You can use some of the work that urlparse does. Try to look things up directly by the netloc it returns and only fall back on the split/join if you must:
def normalize( domain ):
domain = domain.split(".") # ['www', 'google', 'com']
domain = domain[-2:] # ['google', 'com']
return ".".join(domain) # 'google.com'
# caches the netlocs that are not "normal"
aliases = {}
def getinfo( url ):
netloc = urlparse.urlparse(response.url).netloc
if netloc in aliases:
return domain_info[aliases[netloc]]
if netloc in domain_info:
return domain_info[netloc]
main = normalize(netloc)
if main in domain_info:
aliases[netloc] = main
return domain_info[netloc]
Same thing with a caching lib:
from beaker.cache import CacheManager
netlocs = CacheManager(namespace='netloc')
#netlocs.cache()
def getloc( domain ):
try:
return domain_info[domain]
except KeyError:
domain = domain.split(".")
domain = domain[-2:]
domain = ".".join(domain)
return domain_info[domain]
def getinfo( url ):
netloc = urlparse.urlparse(response.url).netloc
return getloc( netloc )
Maybe it helps a bit, but it really depends on the variety of urls you have.

Categories