How to conver url to subdomain using Python - python

So I have a list of URLs in a urls.txt file containing URL like examples given below:
https://benetech.blogspot.com/2019/02/robin-seaman-agent-of-inclusion.html
https://nikpeachey.blogspot.com/2020/01/digital-tools-for-teachers-trainers.html
https://blogurls245.blogspot.com/
Now I want to convert all URLs of that urls.txt to the subdomain, like the example given below:
https://benetech.blogspot.com
https://nikpeachey.blogspot.com
https://blogurls245.blogspot.com
I tried to do it using the TLD module but being an extreme beginner into Python couldn't figure out!
It'd be great if someone could help me with this getting done via Python.

Use the urllib.parse module to parse the URL into its constituent parts and assemble it back together, omitting parts you're not interested in:
from urllib.parse import urlsplit, urlunsplit
url = 'https://benetech.blogspot.com/2019/02/robin-seaman-agent-of-inclusion.html'
base = urlunsplit(urlsplit(url)[:2] + ('', '', ''))
print(base) # https://benetech.blogspot.com

Using the urllib.parse module from the standard library:
url_parts = urllib.parse.urlparse(url)
url_parts.path = “”
url_parts.query = “”
url_parts.fragment = “”
domain_only_url = urllib.parse.urlunparse(url_parts)

from urllib.parse import urlparse
sample_url = 'https://benetech.blogspot.com/2019/02/robin-seaman-agent-of-inclusion.html'
parsed_url = urlparse(sample_url)
subdomain = f'{parsed_url.scheme}://{parsed_url.hostname}'
print(subdomain)
Output:
https://benetech.blogspot.com

Do it like this:
url = 'https://benetech.blogspot.com/2019/02/robin-seaman-agent-of-inclusion.html'
parts = url.split('/')
subdomain = parts[0] + '//' + parts[2]
subdomain will be --> https://benetech.blogspot.com
split('/') will split string to several parts with /.
i.e --> 'my/name/is/Amirreza' will be --> ['my','name','is','Amirreza']

Related

Delete only a specific query from an URL

SO I have the following URL: https://foo.bar?query1=value1&query2=value2&query3=value3
I'd need a function that can strip just query2 for example, so that the result would be:
https://foo.bar?query1=value1&query3=value3
I think maybe urllib.parse or furl can do this in an easy and clean way?
You should use urllib.parse as it's designed exactly for these purposes. I'm unclear the reason for anyone reinventing the wheel here.
Basically 3 steps:
Use urlparse to parse the url into it's component parts
Use parse_qs to parse the query string part of that keeping blanks (if relevant intact)
Remove the erroneous query2 and re-encode the query string and url back
From the docs:
Parse a URL into six components, returning a 6-item named tuple. This
corresponds to the general structure of a URL:
scheme://netloc/path;parameters?query#fragment. Each tuple item is a
string, possibly empty.
from urllib.parse import urlparse, urlencode, parse_qs, urlunparse
url = "https://foo.bar?query1=value1&query2=value2&query3=value3"
url_bits = list(urlparse(url))
print(url_bits)
query_string = parse_qs(url_bits[4], keep_blank_values=True)
print(query_string)
del(query_string['query2'])
url_bits[4] = urlencode(query_string, doseq=True)
new_url = urlunparse(url_bits)
print(new_url)
# >>>['https', 'foo.bar', '', '', 'query1=value1&query2=value2&query3=value3', '']
# >>>{'query1': ['value1'], 'query2': ['value2'], 'query3': ['value3']}
# >>>https://foo.bar?query1=value1&query3=value3
If you want by position:
url="https://foo.bar?query1=value1&query2=value2&query3=value3"
findindex1=url.find("&")
findindex2=url.find("&",findindex1+1)
url=url[0:findindex1]+url[findindex2:len(url)]
if you want by the name:
url="https://foo.bar?query1=value1&query3=value3&query2=value2"
findindex1=url.find("query2")
findindex2=url.find("&",findindex1+1)
if findindex2==-1:
url=url[0:findindex1-1]
else:
url=url[0:findindex1-1]+url[findindex2:len(url)]
Hi you could try it with regular expressions.
re.sub("ThePatternOfTheURL","ThePatternYouWantToHave", "TheInput")
so it could look something like that
pattern = "'(https\:\/\/)([a-zA-Z.?0-9=]+)([&]query2=value2)([&][a-zA-Z0-9=]+)'"
#filters the third group out with query2
filter = r"\1\2\4"
yourUrl = "https://foo.bar?query1=value1&query2=value2&query3=value3"
newURL=re.sub(pattern, filter, yourUrl)
I think this should work for you

Python Variable Mutation Best Practice

Let's say I'm passing a variable into a function, and I want to ensure it's properly formatted for my end use, with consideration for several potential unwanted formats.
Example; I want to store only lowercase representations of url addresses, without http:// or https://.
def standardize(url):
# Lowercase
temp_url = url
url = temp_url.lower()
# Remove 'http://'
if 'http://' in url:
temp_url = url
url = temp_url.replace('http://', '')
if 'https://' in url:
temp_url = url
url = temp_url.replace('https://', '')
return url
I'm only just encroaching on the title of Novice, and was wondering if there is more pythonic approach to achieving this type of process?
End goal being the trasformation of a url as such https://myurl.com/RANDoM --> myurl.com/random
The application of url string formating isn't of any particular importance.
A simple re.sub will do the trick:
import re
def standardize(url):
return re.sub("^https?://",'',url.lower())
# with 'https'
print(standardize('https://myurl.com/RANDoM')) # prints 'myurl.com/random'
# with 'http'
print(standardize('http://myurl.com/RANDoM')) # prints 'myurl.com/random'
# both works
def standardize(url):
return url.lower().replace("https://","").replace("http://","")
That's as simple as I can make it, but, the chaining is a little ugly.
If you want to import regex, could also do something like this:
import re
def standardize(url):
return re.sub("^https?://", "", url.lower())

How do i truncate url using python [duplicate]

This question already has answers here:
Get protocol + host name from URL
(16 answers)
Closed 9 years ago.
How do i truncate the below URL next to the domain "com" using python. i.e you tube.com only
youtube.com/video/AiL6nL
yahoo.com/video/Hhj9B2
youtube.com/video/MpVHQ
google.com/video/PGuTN
youtube.com/video/VU34MI
Is it possible to truncate like this?
Check out Pythons urlparse library. It is a standard library so nothing else needs to be installed.
So you could do the following:
import urlparse
import re
def check_and_add_http(url):
# checks if 'http://' is present at the start of the URL and adds it if not.
http_regex = re.compile(r'^http[s]?://')
if http_regex.match(url):
# 'http://' or 'https://' is present
return url
else:
# add 'http://' for urlparse to work.
return 'http://' + url
for url in url_list:
url = check_and_add_http(url)
print(urlparse.urlsplit(url)[1])
You can read more about urlsplit() in the documentation, including the indexes if you want to read the other parts of the URL.
You can use split():
myUrl.split(r"/")[0]
to get "youtube.com"
and:
myUrl.split(r"/", 1)[1]
to get everything else
I'd use the function urlsplit from the standard library:
from urlparse import urlsplit # python 2
from urllib.parse import urlsplit # python 3
myurl = "http://docs.python.org/2/library/urlparse.html"
urlsplit(myurl)[1] # returns 'docs.python.org'
No library function can tell that those strings are supposed to be absolute URLs, since, formally, they are relative ones. So, you have to prepend //.
>>> url = 'youtube.com/bla/foo'
>>> urlparse.urlsplit('//' + url)[1]
> 'youtube.com'
Just a crazy alternative solution using tldextract:
>>> import tldextract
>>> ext = tldextract.extract('youtube.com/video/AiL6nL')
>>> ".".join(ext[1:3])
'youtube.com'
For your particular input, you could use str.partition() or str.split():
print('youtube.com/video/AiL6nL'.partition('/')[0])
# -> youtube.com
Note: urlparse module (that you could use in general to parse an url) doesn't work in this case:
import urlparse
urlparse.urlsplit('youtube.com/video/AiL6nL')
# -> SplitResult(scheme='', netloc='', path='youtube.com/video/AiL6nL',
# query='', fragment='')
In general, it is safe to use a regex here if you know that all lines start with a hostname and otherwise each line contains a well-formed uri:
import re
print("\n".join(re.findall(r"(?m)^\s*([^\/?#]*)", text)))
Output
youtube.com
yahoo.com
youtube.com
google.com
youtube.com
Note: it doesn't remove the optional port part -- host:port.

Find http:// and or www. and strip from domain. leaving domain.com

I'm quite new to python. I'm trying to parse a file of URLs to leave only the domain name.
some of the urls in my log file begin with http:// and some begin with www.Some begin with both.
This is the part of my code which strips the http:// part. What do I need to add to it to look for both http and www. and remove both?
line = re.findall(r'(https?://\S+)', line)
Currently when I run the code only http:// is stripped. if I change the code to the following:
line = re.findall(r'(https?://www.\S+)', line)
Only domains starting with both are affected.
I need the code to be more conditional.
TIA
edit... here is my full code...
import re
import sys
from urlparse import urlparse
f = open(sys.argv[1], "r")
for line in f.readlines():
line = re.findall(r'(https?://\S+)', line)
if line:
parsed=urlparse(line[0])
print parsed.hostname
f.close()
I mistagged by original post as regex. it is indeed using urlparse.
It might be overkill for this specific situation, but i'd generally use urlparse.urlsplit (Python 2) or urllib.parse.urlsplit (Python 3).
from urllib.parse import urlsplit # Python 3
from urlparse import urlsplit # Python 2
import re
url = 'www.python.org'
# URLs must have a scheme
# www.python.org is an invalid URL
# http://www.python.org is valid
if not re.match(r'http(s?)\:', url):
url = 'http://' + url
# url is now 'http://www.python.org'
parsed = urlsplit(url)
# parsed.scheme is 'http'
# parsed.netloc is 'www.python.org'
# parsed.path is None, since (strictly speaking) the path was not defined
host = parsed.netloc # www.python.org
# Removing www.
# This is a bad idea, because www.python.org could
# resolve to something different than python.org
if host.startswith('www.'):
host = host[4:]
You can do without regexes here.
with open("file_path","r") as f:
lines = f.read()
lines = lines.replace("http://","")
lines = lines.replace("www.", "") # May replace some false positives ('www.com')
urls = [url.split('/')[0] for url in lines.split()]
print '\n'.join(urls)
Example file input:
http://foo.com/index.html
http://www.foobar.com
www.bar.com/?q=res
www.foobar.com
Output:
foo.com
foobar.com
bar.com
foobar.com
Edit:
There could be a tricky url like foobarwww.com, and the above approach would strip the www. We will have to then revert back to using regexes.
Replace the line lines = lines.replace("www.", "") with lines = re.sub(r'(www.)(?!com)',r'',lines). Of course, every possible TLD should be used for the not-match pattern.
I came across the same problem. This is a solution based on regular expressions:
>>> import re
>>> rec = re.compile(r"https?://(www\.)?")
>>> rec.sub('', 'https://domain.com/bla/').strip().strip('/')
'domain.com/bla'
>>> rec.sub('', 'https://domain.com/bla/ ').strip().strip('/')
'domain.com/bla'
>>> rec.sub('', 'http://domain.com/bla/ ').strip().strip('/')
'domain.com/bla'
>>> rec.sub('', 'http://www.domain.com/bla/ ').strip().strip('/')
'domain.com/bla'
Check out the urlparse library, which can do these things for you automatically.
>>> urlparse.urlsplit('http://www.google.com.au/q?test')
SplitResult(scheme='http', netloc='www.google.com.au', path='/q', query='test', fragment='')
You can use urlparse. Also, the solution should be generic to remove things other than 'www' before the domain name (i.e., handle cases like server1.domain.com). The following is a quick try that should work:
from urlparse import urlparse
url = 'http://www.muneeb.org/files/alan_turing_thesis.jpg'
o = urlparse(url)
domain = o.hostname
temp = domain.rsplit('.')
if(len(temp) == 3):
domain = temp[1] + '.' + temp[2]
print domain
I believe #Muneeb Ali is the nearest to the solution but the problem appear when is something like frontdomain.domain.co.uk....
I suppose:
for i in range(1,len(temp)-1):
domain = temp[i]+"."
domain = domain + "." + temp[-1]
Is there a nicer way to do this?

Python url construction: escape characters other than regular letters

I am using wikipedia api and using following api request,
http://en.wikipedia.org/w/api.php?`action=query&meta=globaluserinfo&guiuser='$cammer'&guiprop=groups|merged|unattached&format=json`
but the problem is I am unable to escape Dollar Sign and similar characters like that, I tried the following but it didn't work,
r['guiprop'] = u'groups|merged|unattached'
r['guiuser'] = u'$cammer'
I found it this in w3school but checking this for every single character would a pain full, what would be the best way to escape this in the strip.http://www.w3schools.com/tags/ref_urlencode.asp
You should take a look at using urlencode.
from urllib import urlencode
base_url = "http://en.wikipedia.org/w/api.php?"
arguments = dict(action="query",
meta="globaluserinfo",
guiuser="$cammer",
guiprop="groups|merged|unattached",
format="json")
url = base_url + urlencode(arguments)
If you don't need to build a complete url you can just use the quote function for a single string:
>>> import urllib
>>> urllib.quote("$cammer")
'%24cammer'
So you end up with:
r['guiprop'] = urllib.quote(u'groups|merged|unattached')
r['guiuser'] = urllib.quote(u'$cammer')

Categories