Let's say I'm passing a variable into a function, and I want to ensure it's properly formatted for my end use, with consideration for several potential unwanted formats.
Example; I want to store only lowercase representations of url addresses, without http:// or https://.
def standardize(url):
# Lowercase
temp_url = url
url = temp_url.lower()
# Remove 'http://'
if 'http://' in url:
temp_url = url
url = temp_url.replace('http://', '')
if 'https://' in url:
temp_url = url
url = temp_url.replace('https://', '')
return url
I'm only just encroaching on the title of Novice, and was wondering if there is more pythonic approach to achieving this type of process?
End goal being the trasformation of a url as such https://myurl.com/RANDoM --> myurl.com/random
The application of url string formating isn't of any particular importance.
A simple re.sub will do the trick:
import re
def standardize(url):
return re.sub("^https?://",'',url.lower())
# with 'https'
print(standardize('https://myurl.com/RANDoM')) # prints 'myurl.com/random'
# with 'http'
print(standardize('http://myurl.com/RANDoM')) # prints 'myurl.com/random'
# both works
def standardize(url):
return url.lower().replace("https://","").replace("http://","")
That's as simple as I can make it, but, the chaining is a little ugly.
If you want to import regex, could also do something like this:
import re
def standardize(url):
return re.sub("^https?://", "", url.lower())
Related
I would like to extract entire unique url items in my list in order to move on a web scraping project. Although I have huge list of URLs on my side, I would like to generate here minimalist scenario to explain main issue on my side. Assume that my list is like that:
url_list = ["https://www.ox.ac.uk/",
"http://www.ox.ac.uk/",
"https://www.ox.ac.uk",
"http://www.ox.ac.uk",
"https://www.ox.ac.uk/index.php",
"https://www.ox.ac.uk/index.html",
"http://www.ox.ac.uk/index.php",
"http://www.ox.ac.uk/index.html",
"www.ox.ac.uk/",
"ox.ac.uk",
"https://www.ox.ac.uk/research"
]
def ExtractUniqueUrls(urls):
pass
ExtractUniqueUrls(url_list)
For the minimalist scenario, I am expecting there are only two unique urls which are "https://www.ox.ac.uk" and "https://www.ox.ac.uk/research". Although each url element have some differences such as "http", "https", with ending "/", without ending "/", index.php, index.html; they are all pointing exactly the same web page. There might be some other possibilities which I already missed them (Please remember them if you catch any). Anyway, what is the proper and efficient way to handle this issue using Python 3?
I am not looking for a hard-coded solution like focusing on each case individually. For instance, I do not want to manually check whether the url has "/" at the end or not. Possibly there is a much better solution with other packages such as urllib? For that reason, I looked the method of urllib.parse, but I could not come up a proper solution so far.
Thanks
Edit: I added one more example into my list at the end in order to explain in a better way. Otherwise, you might assume that I am looking for the root url, but this not the case at all.
By only following all cases you've reveiled:
url_list = ["https://www.ox.ac.uk/",
"http://www.ox.ac.uk/",
"https://www.ox.ac.uk",
"http://www.ox.ac.uk",
"https://www.ox.ac.uk/index.php",
"https://www.ox.ac.uk/index.html",
"http://www.ox.ac.uk/index.php",
"http://www.ox.ac.uk/index.html",
"www.ox.ac.uk/",
"ox.ac.uk",
"ox.ac.uk/research",
"ox.ac.uk/index.php?12"]
def url_strip_gen(source: list):
replace_dict = {".php": "", ".html": "", "http://": "", "https://": ""}
for url in source:
for key, val in replace_dict.items():
url = url.replace(key, val, 1)
url = url.rstrip('/')
yield url[4:] if url.startswith("www.") else url
print(set(url_strip_gen(url_list)))
{'ox.ac.uk/index?12', 'ox.ac.uk/index', 'ox.ac.uk/research', 'ox.ac.uk'}
This won't cover case if url contains .html like www.htmlsomething, in that case it can be compensated with urlparse as it stores path and url separately like below:
>>> import pprint
>>> from urllib.parse import urlparse
>>> a = urlparse("http://ox.ac.uk/index.php?12")
>>> pprint.pprint(a)
ParseResult(scheme='http', netloc='ox.ac.uk', path='/index.php', params='', query='12', fragment='')
However, if without scheme:
>>> a = urlparse("ox.ac.uk/index.php?12")
>>> pprint.pprint(a)
ParseResult(scheme='', netloc='', path='ox.ac.uk/index.php', params='', query='12', fragment='')
All host goes to path attribute.
To compensate this we either need to remove scheme and add one for all or check if url starts with scheme else add one. Prior is easier to implement.
replace_dict = {"http://": "", "https://": ""}
for url in source:
# Unify scheme to HTTP
for key, val in replace_dict.items():
url = url.replace(key, val, 1)
url = "http://" + (url[4:] if url.startswith("www.") else url)
parsed = urlparse(url)
With this you are guaranteed to get separate control of each sections for your url via urlparse. However as you do not specified which parameter should be considered for url to be unique enough, I'll leave that task to you.
Here's a quick and dirty attempt:
def extract_unique_urls(url_list):
unique_urls = []
for url in url_list:
# Removing the 'https://' etc. part
if url.find('//') > -1:
url = url.split('//')[1]
# Removing the 'www.' part
url = url.replace('www.', '')
# Removing trailing '/'
url = url.rstrip('/')
# If not root url then inspect the last part of the url
if url.find('/') > -1:
# Extracting the last part
last_part = url.split('/')[-1]
# Deciding if to keep the last part (no if '.' in it)
if last_part.find('.') > -1:
# If no to keep: Removing last part and getting rid of
# trailing '/'
url = '/'.join(url.split('/')[:-1]).rstrip('/')
# Append if not already in list
if url not in unique_urls:
unique_urls.append(url)
# Sorting for the fun of it
return sorted(unique_urls)
I'm sure it doesn't cover all possible cases. But maybe you can extend it if that's not the case. I'm also not sure if you wanted to keep the 'http(s)://' parts. If yes, then just add them to the results.
I'm still a newbie in Python but I'm trying to make my first little program.
My intention is to print only the link ending with .m3u8 (if available) istead of printing the whole web page.
The code I'm currently using:
import requests
channel1 = requests.get('https://website.tv/user/111111')
print(channel1.content)
print('\n')
channel2 = requests.get('https://website.tv/user/222222')
print(channel2.content)
print('\n')
input('Press Enter to Exit...')
The link I'm looking for always has 47 characters in total, and it's always the same model just changing the stream id represented as X:
https://website.tv/live/streamidXXXXXXXXX.m3u8
Can anyone help me?
You can use regex for this problem.
Explanation:
here in the expression portion .*? means to consider everything and whatever enclosed in \b(expr)\b needs to be present there mandatorily.
For e.g.:
import re
link="https://website.tv/live/streamidXXXXXXXXX.m3u8"
p=re.findall(r'.*?\b.m3u8\b',link)
print(p)
OUTPUT:
['https://website.tv/live/streamidXXXXXXXXX.m3u8']
There are a few ways to go about this, one that springs to mind which others have touched upon is using regex with findall that returns back a list of matched urls from our url_list.
Another option could also be BeautifulSoup but without more information regarding the html structure it may not be the best tool here.
Using Regex
from re import findall
from requests import get
def check_link(response):
result = findall(
r'.*?\b.m3u8\b',
str(response.content),
)
return result
def main(url):
response = get(url)
if response.ok:
link_found = check_link(response)
if link_found:
print('link {} found at {}'.format(
link_found,
url,
),
)
if __name__ == '__main__':
url_list = [
'http://www.test_1.com',
'http://www.test_2.com',
'http://www.test_3.com',
]
for url in url_list:
main(url)
print("All finished")
If I understand your question correctly I think you want to use Python's .split() string method. If your goal is to take a string like "https://website.tv/live/streamidXXXXXXXXX.m3u8" and extract just "streamidXXXXXXXXX.m3u8" then you could do that with the following code:
web_address = "https://website.tv/live/streamidXXXXXXXXX.m3u8"
specific_file = web_address.split('/')[-1]
print(specific_file)
The calling .split('/') on the string like that will return a list of strings where each item in the list is a different part of the string (first part being "https:", etc.). The last one of these (index [-1]) will be the file extension you want.
This will extract all URLs from webpage and filter only those which contain your required keyword ".m3u8"
import requests
import re
def get_desired_url(data):
urls = []
for url in re.findall(r'(https?://\S+)', data):
if ".m3u8" in url:
urls.append(url)
return urls
channel1 = requests.get('https://website.tv/user/111111')
urls = get_desired_url(channel1 )
Try this, I think this will be robust
import re
links=[re.sub('^<[ ]*a[ ]+.*href[ ]*=[ ]*', '', re.sub('.*>$', '', link) for link in re.findall(r'<[ ]*a[ ]+.*href[ ]*=[]*"http[s]*://.+\.m3u8".*>',channel2.content)]
I've created a script in python using regular expression to parse emails from few websites. The pattern that I've used to grab email is \w+#\w+\.{1}\w+ which works most of the cases. However, trouble comes up when it encounters items like 8b4e078a51d04e0e9efdf470027f0ec1#sentry.wixpress, Slice_1#2x.png e.t.c. The pattern grabs them as well which I would like to get rid of.
I've tried with:
import re
import requests
pattern = r'\w+#\w+\.{1}\w+'
urls = (
'https://rainforestfarms.org/contact',
'https://www.auucvancouver.ca/',
'http://www.bcla.bc.ca/',
'http://www.palstudiotheatre.com/',
)
def get_email(link,pattern):
res = requests.get(link)
email = re.findall(pattern,res.text)
if email:
return link,email[0]
else:
return link
if __name__ == '__main__':
for link in urls:
print(get_email(link,pattern))
Output I'm getting:
('https://rainforestfarms.org/contact', 'rainforestfarmsllc#gmail.com')
('https://www.auucvancouver.ca/', '8b4e078a51d04e0e9efdf470027f0ec1#sentry.wixpress')
('http://www.bcla.bc.ca/', 'Slice_1#2x.png')
('http://www.palstudiotheatre.com/', 'theatre#palvancouver.org')
Output I wish to get:
('https://rainforestfarms.org/contact', 'rainforestfarmsllc#gmail.com')
https://www.auucvancouver.ca/
http://www.bcla.bc.ca/'
('http://www.palstudiotheatre.com/', 'theatre#palvancouver.org')
How can I get rid of unwanted items using regex?
It depends what you means by "unwanted".
One way to define them is to use a whitelist of allowed domain suffixes, for example 'org', 'com', etc.
import re
import requests
pattern = r'\w+#\w+\.(?:com|org)'
urls = (
'https://rainforestfarms.org/contact',
'https://www.auucvancouver.ca/',
'http://www.bcla.bc.ca/',
'http://www.palstudiotheatre.com/',
)
def get_email(link,pattern):
res = requests.get(link)
email = re.findall(pattern, res.text)
if email:
return link, email[0]
else:
return link
for link in urls:
print(get_email(link,pattern))
yields
('https://rainforestfarms.org/contact', 'rainforestfarmsllc#gmail.com')
https://www.auucvancouver.ca/
http://www.bcla.bc.ca/
('http://www.palstudiotheatre.com/', 'theatre#palvancouver.org')
You could obviously do more complex things such as blacklists or regex patterns for the suffix.
As always for this kind of question I strongly recommend using regex101 to check and understand your regex.
I have urls formatted as:
google.com
www.google.com
http://google.com
http://www.google.com
I would like to convert all type of links to a uniform format, starting with http://
http://google.com
How can I prepend URLs with http:// using Python?
Python do have builtin functions to treat that correctly, like
p = urlparse.urlparse(my_url, 'http')
netloc = p.netloc or p.path
path = p.path if p.netloc else ''
if not netloc.startswith('www.'):
netloc = 'www.' + netloc
p = urlparse.ParseResult('http', netloc, path, *p[3:])
print(p.geturl())
If you want to remove (or add) the www part, you have to edit the .netloc field of the resulting object before calling .geturl().
Because ParseResult is a namedtuple, you cannot edit it in-place, but have to create a new object.
PS:
For Python3, it should be urllib.parse.urlparse
I found it easy to detect the protocol with regex and then append it if missing:
import re
def formaturl(url):
if not re.match('(?:http|ftp|https)://', url):
return 'http://{}'.format(url)
return url
url = 'test.com'
print(formaturl(url)) # http://test.com
url = 'https://test.com'
print(formaturl(url)) # https://test.com
I hope it helps!
For the formats that you mention in your question, you can do something as simple as:
def convert(url):
if url.startswith('http://www.'):
return 'http://' + url[len('http://www.'):]
if url.startswith('www.'):
return 'http://' + url[len('www.'):]
if not url.startswith('http://'):
return 'http://' + url
return url
But please note that there are probably other formats that you are not anticipating. In addition, keep in mind that the output URL (according to your definitions) will not necessarily be a valid one (i.e., the DNS will not be able to translate it into a valid IP address).
If you URLs are a string type you could just concatenate.
one = "https://"
two = "www.privateproperty.co.za"
link = "".join((one, two))
def fix_url(orig_link):
# force scheme
split_comps = urlsplit(orig_link, scheme='https')
# fix netloc (can happen when there is no scheme)
if not len(split_comps.netloc):
if len(split_comps.path):
# override components with fixed netloc and path
split_comps = SplitResult(scheme='https',netloc=split_comps.path,path='',query=split_comps.query,fragment=split_comps.fragment)
else: # no netloc, no path
raise ValueError
return urlunsplit(split_comps)
I'm quite new to python. I'm trying to parse a file of URLs to leave only the domain name.
some of the urls in my log file begin with http:// and some begin with www.Some begin with both.
This is the part of my code which strips the http:// part. What do I need to add to it to look for both http and www. and remove both?
line = re.findall(r'(https?://\S+)', line)
Currently when I run the code only http:// is stripped. if I change the code to the following:
line = re.findall(r'(https?://www.\S+)', line)
Only domains starting with both are affected.
I need the code to be more conditional.
TIA
edit... here is my full code...
import re
import sys
from urlparse import urlparse
f = open(sys.argv[1], "r")
for line in f.readlines():
line = re.findall(r'(https?://\S+)', line)
if line:
parsed=urlparse(line[0])
print parsed.hostname
f.close()
I mistagged by original post as regex. it is indeed using urlparse.
It might be overkill for this specific situation, but i'd generally use urlparse.urlsplit (Python 2) or urllib.parse.urlsplit (Python 3).
from urllib.parse import urlsplit # Python 3
from urlparse import urlsplit # Python 2
import re
url = 'www.python.org'
# URLs must have a scheme
# www.python.org is an invalid URL
# http://www.python.org is valid
if not re.match(r'http(s?)\:', url):
url = 'http://' + url
# url is now 'http://www.python.org'
parsed = urlsplit(url)
# parsed.scheme is 'http'
# parsed.netloc is 'www.python.org'
# parsed.path is None, since (strictly speaking) the path was not defined
host = parsed.netloc # www.python.org
# Removing www.
# This is a bad idea, because www.python.org could
# resolve to something different than python.org
if host.startswith('www.'):
host = host[4:]
You can do without regexes here.
with open("file_path","r") as f:
lines = f.read()
lines = lines.replace("http://","")
lines = lines.replace("www.", "") # May replace some false positives ('www.com')
urls = [url.split('/')[0] for url in lines.split()]
print '\n'.join(urls)
Example file input:
http://foo.com/index.html
http://www.foobar.com
www.bar.com/?q=res
www.foobar.com
Output:
foo.com
foobar.com
bar.com
foobar.com
Edit:
There could be a tricky url like foobarwww.com, and the above approach would strip the www. We will have to then revert back to using regexes.
Replace the line lines = lines.replace("www.", "") with lines = re.sub(r'(www.)(?!com)',r'',lines). Of course, every possible TLD should be used for the not-match pattern.
I came across the same problem. This is a solution based on regular expressions:
>>> import re
>>> rec = re.compile(r"https?://(www\.)?")
>>> rec.sub('', 'https://domain.com/bla/').strip().strip('/')
'domain.com/bla'
>>> rec.sub('', 'https://domain.com/bla/ ').strip().strip('/')
'domain.com/bla'
>>> rec.sub('', 'http://domain.com/bla/ ').strip().strip('/')
'domain.com/bla'
>>> rec.sub('', 'http://www.domain.com/bla/ ').strip().strip('/')
'domain.com/bla'
Check out the urlparse library, which can do these things for you automatically.
>>> urlparse.urlsplit('http://www.google.com.au/q?test')
SplitResult(scheme='http', netloc='www.google.com.au', path='/q', query='test', fragment='')
You can use urlparse. Also, the solution should be generic to remove things other than 'www' before the domain name (i.e., handle cases like server1.domain.com). The following is a quick try that should work:
from urlparse import urlparse
url = 'http://www.muneeb.org/files/alan_turing_thesis.jpg'
o = urlparse(url)
domain = o.hostname
temp = domain.rsplit('.')
if(len(temp) == 3):
domain = temp[1] + '.' + temp[2]
print domain
I believe #Muneeb Ali is the nearest to the solution but the problem appear when is something like frontdomain.domain.co.uk....
I suppose:
for i in range(1,len(temp)-1):
domain = temp[i]+"."
domain = domain + "." + temp[-1]
Is there a nicer way to do this?