Suppose I was given a URL.
It might already have GET parameters (e.g. http://example.com/search?q=question) or it might not (e.g. http://example.com/).
And now I need to add some parameters to it like {'lang':'en','tag':'python'}. In the first case I'm going to have http://example.com/search?q=question&lang=en&tag=python and in the second — http://example.com/search?lang=en&tag=python.
Is there any standard way to do this?
There are a couple of quirks with the urllib and urlparse modules. Here's a working example:
try:
import urlparse
from urllib import urlencode
except: # For Python 3
import urllib.parse as urlparse
from urllib.parse import urlencode
url = "http://stackoverflow.com/search?q=question"
params = {'lang':'en','tag':'python'}
url_parts = list(urlparse.urlparse(url))
query = dict(urlparse.parse_qsl(url_parts[4]))
query.update(params)
url_parts[4] = urlencode(query)
print(urlparse.urlunparse(url_parts))
ParseResult, the result of urlparse(), is read-only and we need to convert it to a list before we can attempt to modify its data.
Outsource it to the battle tested requests library.
This is how I will do it:
from requests.models import PreparedRequest
url = 'http://example.com/search?q=question'
params = {'lang':'en','tag':'python'}
req = PreparedRequest()
req.prepare_url(url, params)
print(req.url)
Why
I've been not satisfied with all the solutions on this page (come on, where is our favorite copy-paste thing?) so I wrote my own based on answers here. It tries to be complete and more Pythonic. I've added a handler for dict and bool values in arguments to be more consumer-side (JS) friendly, but they are yet optional, you can drop them.
How it works
Test 1: Adding new arguments, handling Arrays and Bool values:
url = 'http://stackoverflow.com/test'
new_params = {'answers': False, 'data': ['some','values']}
add_url_params(url, new_params) == \
'http://stackoverflow.com/test?data=some&data=values&answers=false'
Test 2: Rewriting existing args, handling DICT values:
url = 'http://stackoverflow.com/test/?question=false'
new_params = {'question': {'__X__':'__Y__'}}
add_url_params(url, new_params) == \
'http://stackoverflow.com/test/?question=%7B%22__X__%22%3A+%22__Y__%22%7D'
Talk is cheap. Show me the code.
Code itself. I've tried to describe it in details:
from json import dumps
try:
from urllib import urlencode, unquote
from urlparse import urlparse, parse_qsl, ParseResult
except ImportError:
# Python 3 fallback
from urllib.parse import (
urlencode, unquote, urlparse, parse_qsl, ParseResult
)
def add_url_params(url, params):
""" Add GET params to provided URL being aware of existing.
:param url: string of target URL
:param params: dict containing requested params to be added
:return: string with updated URL
>> url = 'http://stackoverflow.com/test?answers=true'
>> new_params = {'answers': False, 'data': ['some','values']}
>> add_url_params(url, new_params)
'http://stackoverflow.com/test?data=some&data=values&answers=false'
"""
# Unquoting URL first so we don't loose existing args
url = unquote(url)
# Extracting url info
parsed_url = urlparse(url)
# Extracting URL arguments from parsed URL
get_args = parsed_url.query
# Converting URL arguments to dict
parsed_get_args = dict(parse_qsl(get_args))
# Merging URL arguments dict with new params
parsed_get_args.update(params)
# Bool and Dict values should be converted to json-friendly values
# you may throw this part away if you don't like it :)
parsed_get_args.update(
{k: dumps(v) for k, v in parsed_get_args.items()
if isinstance(v, (bool, dict))}
)
# Converting URL argument to proper query string
encoded_get_args = urlencode(parsed_get_args, doseq=True)
# Creating new parsed result object based on provided with new
# URL arguments. Same thing happens inside of urlparse.
new_url = ParseResult(
parsed_url.scheme, parsed_url.netloc, parsed_url.path,
parsed_url.params, encoded_get_args, parsed_url.fragment
).geturl()
return new_url
Please be aware that there may be some issues, if you'll find one please let me know and we will make this thing better
You want to use URL encoding if the strings can have arbitrary data (for example, characters such as ampersands, slashes, etc. will need to be encoded).
Check out urllib.urlencode:
>>> import urllib
>>> urllib.urlencode({'lang':'en','tag':'python'})
'lang=en&tag=python'
In python3:
from urllib import parse
parse.urlencode({'lang':'en','tag':'python'})
You can also use the furl module https://github.com/gruns/furl
>>> from furl import furl
>>> print furl('http://example.com/search?q=question').add({'lang':'en','tag':'python'}).url
http://example.com/search?q=question&lang=en&tag=python
If you are using the requests lib:
import requests
...
params = {'tag': 'python'}
requests.get(url, params=params)
Based on this answer, one-liner for simple cases (Python 3 code):
from urllib.parse import urlparse, urlencode
url = "https://stackoverflow.com/search?q=question"
params = {'lang':'en','tag':'python'}
url += ('&' if urlparse(url).query else '?') + urlencode(params)
or:
url += ('&', '?')[urlparse(url).query == ''] + urlencode(params)
I find this more elegant than the two top answers:
from urllib.parse import urlencode, urlparse, parse_qs
def merge_url_query_params(url: str, additional_params: dict) -> str:
url_components = urlparse(url)
original_params = parse_qs(url_components.query)
# Before Python 3.5 you could update original_params with
# additional_params, but here all the variables are immutable.
merged_params = {**original_params, **additional_params}
updated_query = urlencode(merged_params, doseq=True)
# _replace() is how you can create a new NamedTuple with a changed field
return url_components._replace(query=updated_query).geturl()
assert merge_url_query_params(
'http://example.com/search?q=question',
{'lang':'en','tag':'python'},
) == 'http://example.com/search?q=question&lang=en&tag=python'
The most important things I dislike in the top answers (they are nevertheless good):
Łukasz: having to remember the index at which the query is in the URL components
Sapphire64: the very verbose way of creating the updated ParseResult
What's bad about my response is the magically looking dict merge using unpacking, but I prefer that to updating an already existing dictionary because of my prejudice against mutability.
Yes: use urllib.
From the examples in the documentation:
>>> import urllib
>>> params = urllib.urlencode({'spam': 1, 'eggs': 2, 'bacon': 0})
>>> f = urllib.urlopen("http://www.musi-cal.com/cgi-bin/query?%s" % params)
>>> print f.geturl() # Prints the final URL with parameters.
>>> print f.read() # Prints the contents
python3, self explanatory I guess
from urllib.parse import urlparse, urlencode, parse_qsl
url = 'https://www.linkedin.com/jobs/search?keywords=engineer'
parsed = urlparse(url)
current_params = dict(parse_qsl(parsed.query))
new_params = {'location': 'United States'}
merged_params = urlencode({**current_params, **new_params})
parsed = parsed._replace(query=merged_params)
print(parsed.geturl())
# https://www.linkedin.com/jobs/search?keywords=engineer&location=United+States
I liked Łukasz version, but since urllib and urllparse functions are somewhat awkward to use in this case, I think it's more straightforward to do something like this:
params = urllib.urlencode(params)
if urlparse.urlparse(url)[4]:
print url + '&' + params
else:
print url + '?' + params
Use the various urlparse functions to tear apart the existing URL, urllib.urlencode() on the combined dictionary, then urlparse.urlunparse() to put it all back together again.
Or just take the result of urllib.urlencode() and concatenate it to the URL appropriately.
Yet another answer:
def addGetParameters(url, newParams):
(scheme, netloc, path, params, query, fragment) = urlparse.urlparse(url)
queryList = urlparse.parse_qsl(query, keep_blank_values=True)
for key in newParams:
queryList.append((key, newParams[key]))
return urlparse.urlunparse((scheme, netloc, path, params, urllib.urlencode(queryList), fragment))
In python 2.5
import cgi
import urllib
import urlparse
def add_url_param(url, **params):
n=3
parts = list(urlparse.urlsplit(url))
d = dict(cgi.parse_qsl(parts[n])) # use cgi.parse_qs for list values
d.update(params)
parts[n]=urllib.urlencode(d)
return urlparse.urlunsplit(parts)
url = "http://stackoverflow.com/search?q=question"
add_url_param(url, lang='en') == "http://stackoverflow.com/search?q=question&lang=en"
Here is how I implemented it.
import urllib
params = urllib.urlencode({'lang':'en','tag':'python'})
url = ''
if request.GET:
url = request.url + '&' + params
else:
url = request.url + '?' + params
Worked like a charm. However, I would have liked a more cleaner way to implement this.
Another way of implementing the above is put it in a method.
import urllib
def add_url_param(request, **params):
new_url = ''
_params = dict(**params)
_params = urllib.urlencode(_params)
if _params:
if request.GET:
new_url = request.url + '&' + _params
else:
new_url = request.url + '?' + _params
else:
new_url = request.url
return new_ur
Related
i want to build some function that read a url from txt file, then save it to some variable, then add some values inside the url between another values
example of the url: https://domains.livedns.co.il/API/DomainsAPI.asmx/NewDomain?UserName=apidemo#livedns.co.il&Password=demo
lets say i want to inject some values between UserName and Password and save it into file again and use it later.
i started to write the function and play with urllib parser but i still doesnt understand how to do that.
what i tried until now:
def dlastpurchase():
if os.path.isfile("livednsurl.txt"):
apikeyfile = open("livednsurl.txt", "r")
apikey = apikeyfile.read()
url_parse = urlsplit(apikey)
print(url_parse.geturl())
dlastpurchase()
Thanks in advance for every tip and help
A little bit more complex example that I believe you will find interesting and also enjoy improving it (while it takes care of some scenarios, it might be lacking in some). Also functional to enable reuse in other cases. Here we go
assuming we have a text file, named 'urls.txt' that contains this url
https://domains.livedns.co.il/API/DomainsAPI.asmx/NewDomain?UserName=apidemo#livedns.co.il&Password=demo
from os import error
from urllib.parse import urlparse, parse_qs, urlunparse
filename = 'urls.txt'
function to parse the url and return its query parameters as well as the url object, which will be used to reconstruct the url later on
def parse_url(url):
"""parse a given url and return its query parameters
Args:
url (string): url string to parse
Returns:
parsed (tupple): the tupple object returned by urlparse
query_parameters (dictionary): dictionary containing the query parameters as keys
"""
try :
# parse the url and get the queries parameters from there
parsed = urlparse(url)
# parse the queries and return the dictionary containing them
query_result = parse_qs(parsed.query)
return (query_result, parsed)
except(error):
print('something failed !!!')
print(error)
return False
function to add a new query parameter or to replace an existing one
def insert_or_replace_word(query_dic, word,value):
"""Insert a value for the query parameters of a url
Args:
query_dic (object): the dictionary containing the query parameters
word (string): the query parameter to replace or insert values for
value (string): the value to insert or use as replacement
Returns:
result (string):the result of the insertion or replacement
"""
try:
query_dic[word] = value
return query_dic
except (error):
print('Something went wrong {0}'.format(error))
function to format the query parameter and get them ready to reconstruct the new url
def format_query_strings(query_dic):
"""format the final query dictionaries ready to be used to construct a new url and construct the new url
Args:
query_dic (dictionary): final query dictionary after insertion or update
"""
final_string = ''
for key, value in query_dic.items():
#unfortunatly, query params from parse_qs are in list, so remove them before creating the final string
if type(value) == list:
query_string = '{0}={1}'.format(key, value[0])
final_string += '{0}&'.format(query_string)
else:
query_string = '{0}={1}'.format(key, value)
final_string += '{0}&'.format(query_string)
# this is to remove any extra & inserted at the end of the loop above
if final_string.endswith('&'):
final_string = final_string[:len(final_string)-1]
return final_string
we check out everything works by reading in text file, performing above operation and then saving the new url to a new file
with open(filename) as url:
lines = url.readlines()
for line in lines:
query_params,parsed = parse_url(line)
new_query_dic = insert_or_replace_word(query_params,'UserName','newUsername')
final = format_query_strings(new_query_dic)
#here you have to pass an iterable of lenth 6 in order to reconstruct the url
new_url_object = [parsed.scheme,parsed.netloc,parsed.path,parsed.params,final,parsed.fragment]
#this reconstructs the new url
new_url = urlunparse(new_url_object)
#create a new file and append the link inside of it
with open('new_urls.txt', 'a') as new_file:
new_file.writelines(new_c)
new_file.write('\n')
You don't have to use fancy tools to do that. Just split the url based on "?" Character. Then, split the second part based on "&" character. Add your new params to the list you have, and merge them with the base url you get.
url = "https://domains.livedns.co.il/API/DomainsAPI.asmx/NewDomain?UserName=apidemo#livedns.co.il&Password=demo"
base, params = url.split("?")
params = params.split("&")
params.insert(2, "new_user=yololo&new_passwd=hololo")
for param in params:
base += param + "&"
base = base.strip("&")
print(base)
I did it like this since you asked for inserting to a specific location. But url params are not depends on the order, so you can just append at the end of the url for ease. Or, you can edit the parameters from the list I show.
Is there a way to parse a URL (with some python library) and return a python dictionary with the keys and values of a query parameters part of the URL?
For example:
url = "http://www.example.org/default.html?ct=32&op=92&item=98"
expected return:
{'ct':32, 'op':92, 'item':98}
Use the urllib.parse library:
>>> from urllib import parse
>>> url = "http://www.example.org/default.html?ct=32&op=92&item=98"
>>> parse.urlsplit(url)
SplitResult(scheme='http', netloc='www.example.org', path='/default.html', query='ct=32&op=92&item=98', fragment='')
>>> parse.parse_qs(parse.urlsplit(url).query)
{'item': ['98'], 'op': ['92'], 'ct': ['32']}
>>> dict(parse.parse_qsl(parse.urlsplit(url).query))
{'item': '98', 'op': '92', 'ct': '32'}
The urllib.parse.parse_qs() and urllib.parse.parse_qsl() methods parse out query strings, taking into account that keys can occur more than once and that order may matter.
If you are still on Python 2, urllib.parse was called urlparse.
For Python 3, the values of the dict from parse_qs are in a list, because there might be multiple values. If you just want the first one:
>>> from urllib.parse import urlsplit, parse_qs
>>>
>>> url = "http://www.example.org/default.html?ct=32&op=92&item=98"
>>> query = urlsplit(url).query
>>> params = parse_qs(query)
>>> params
{'item': ['98'], 'op': ['92'], 'ct': ['32']}
>>> dict(params)
{'item': ['98'], 'op': ['92'], 'ct': ['32']}
>>> {k: v[0] for k, v in params.items()}
{'item': '98', 'op': '92', 'ct': '32'}
If you prefer not to use a parser:
url = "http://www.example.org/default.html?ct=32&op=92&item=98"
url = url.split("?")[1]
dict = {x[0] : x[1] for x in [x.split("=") for x in url[1:].split("&") ]}
So I won't delete what's above but it's definitely not what you should use.
I think I read a few of the answers and they looked a little complicated, incase you're like me, don't use my solution.
Use this:
from urllib import parse
params = dict(parse.parse_qsl(parse.urlsplit(url).query))
and for Python 2.X
import urlparse as parse
params = dict(parse.parse_qsl(parse.urlsplit(url).query))
I know this is the same as the accepted answer, just in a one liner that can be copied.
For python 2.7
In [14]: url = "http://www.example.org/default.html?ct=32&op=92&item=98"
In [15]: from urlparse import urlparse, parse_qsl
In [16]: parse_url = urlparse(url)
In [17]: query_dict = dict(parse_qsl(parse_url.query))
In [18]: query_dict
Out[18]: {'ct': '32', 'item': '98', 'op': '92'}
I agree about not reinventing the wheel but sometimes (while you're learning) it helps to build a wheel in order to understand a wheel. :) So, from a purely academic perspective, I offer this with the caveat that using a dictionary assumes that name value pairs are unique (that the query string does not contain multiple records).
url = 'http:/mypage.html?one=1&two=2&three=3'
page, query = url.split('?')
names_values_dict = dict(pair.split('=') for pair in query.split('&'))
names_values_list = [pair.split('=') for pair in query.split('&')]
I'm using version 3.6.5 in the Idle IDE.
from urllib.parse import splitquery, parse_qs, parse_qsl
url = "http://www.example.org/default.html?ct=32&op=92&item=98&item=99"
splitquery(url)
# ('http://www.example.org/default.html', 'ct=32&op=92&item=98&item=99')
parse_qs(splitquery(url)[1])
# {'ct': ['32'], 'op': ['92'], 'item': ['98', '99']}
dict(parse_qsl(splitquery(url)[1]))
# {'ct': '32', 'op': '92', 'item': '99'}
# also works with url w/o query
parse_qs(splitquery("http://example.org")[1])
# {}
dict(parse_qsl(splitquery("http://example.org")[1]))
# {}
Old question, thougt I'd chip in though after I came across this splitquery thingy. Not sure about Python 2 since I dont use Python 2. splitquery is a bit more than re.split(r"\?", url, 1).
For python2.7 I am using urlparse module to parse url query to dict.
import urlparse
url = "http://www.example.org/default.html?ct=32&op=92&item=98"
print urlparse.parse_qs( urlparse.urlparse(url).query )
# result: {'item': ['98'], 'op': ['92'], 'ct': ['32']}
WSGI, python 2.7
Code
import sys
import json
import cgi
import urlparse
def application(environ, start_response):
status = '200 OK'
method = environ['REQUEST_METHOD']
args = urlparse.parse_qs(environ['QUERY_STRING'])
m = args['mesg']
x = {
"input": m[0],
"result": m[0].capitalize()
}
# convert into JSON:
y = json.dumps(x)
output = y
response_headers = [('Content-type', 'application/json'),
('Content-Length', str(len(output)))]
start_response(status, response_headers)
print (sys.version_info)
return [output]
URL
http:///echo.py?mesg=hola
Response
{"input": "hola", "result": "Hola"}
You can easily parse a URL with a speciific library.
Here is my simple code to parse it without any dedicated library.
(the input url must contain a domain name,a protocol and a path.
def parseURL(url):
seg2 = url.split('/')[2] # Separating domain name
seg1 = url.split(seg2)[-2] # Deriving protocol
print('Protocol:', seg1, '\n')
print('Domain name:', seg2, '\n')
seg3 = url.split(seg2)[1] #Getting the path; if output is empty,the there is no path in URL
print('Path:', seg3, '\n')
if '#' in url: # Extracting fragment id, else None
seg4 = url.split('#')[1]
print('Fragment ID:', seg4, '\n')
else:
seg4 = 'None'
if '#' in url: # Extracting user name, else None
seg5 = url.split('/')[-1]
print('Scheme with User Name:', seg5, '\n')
else:
seg5 = 'None'
if '?' in url: # Extracting query string, else None
seg6 = url.split('?')[-1]
print('Query string:', seg6, '\n')
else:
seg6 = 'None'
print('**The dictionary is in the sequence: 0.URL 1.Protocol 2.Domain name 3.Path 4.Fragment id 5.User name 6.Query string** \n')
dictionary = {'0.URL': url, '1.Protocol': seg1, '2.Domain name': seg2, '3.Path': seg3, '4.Fragment id': seg4,
'5.User name': seg5, '6.Query string': seg6} # Printing required dictionary
print(dictionary, '\n')
print('The TLD in the given URL is following: ')
if '.com' in url: # Extracting most famous TLDs maintained by ICAAN
print('.com\n')
elif '.de' in url:
print('.de\n')
elif '.uk' in url:
print('.uk\n')
elif 'gov' in url:
print('gov\n')
elif '.org' in url:
print('.org\n')
elif '.ru' in url:
print('.ru\n')
elif '.net' in url:
print('.net\n')
elif '.info' in url:
print('.info\n')
elif '.biz' in url:
print('.biz\n')
elif '.online' in url:
print('.online\n')
elif '.in' in url:
print('.in\n')
elif '.edu' in url:
print('.edu\n')
else:
print('Other low level domain!\n')
return dictionary
if name == 'main':
url = input("Enter your URL: ")
parseURL(url)
#Sample URLS to copy
# url='https://www.facebook.com/photo.php?fbid=2068026323275211&set=a.269104153167446&type=3&theater'
# url='http://www.blog.google.uk:1000/path/to/myfile.html?key1=value1&key2=value2#InTheDocument'
# url='https://www.overleaf.com/9565720ckjijuhzpbccsd#/347876331/'
I want to split a URL into three strings.
Example:
https://www.google.com:443
http://amazon.com:467
I would like the output to be:
string 1: https or http
string 2: www.google.com or amazon.com
string 3: 443 or 467
The above output is based on the example provided. Basically I want to split the string into protocol, domain and port and assign to three different variables.
ULRs are more complicated than one might think which is why it's generally a good idea to use proven code to parse them and handle unexpected edge cases. Python has urllib.parse in the library, which you should use rather than trying to parse this your self.
The parts you want are in the scheme, hostname, and port properties of the object returned from urlsparse()
For example:
from urllib.parse import urlparse
def getParts(url_string):
p = urlparse(url_string)
return [p.scheme, p.hostname, p.port]
getParts('https://www.google.com:443')
# ['https', 'www.google.com', 443]
getParts('http://amazon.com:467')
# ['http', 'amazon.com', 467]
# surprising, but valid url:
getParts('https://en.wikipedia.org:443/wiki/Template:Welcome')
# ['https', 'en.wikipedia.org', 443]
# missing parts:
getParts('//www.google.com/example/home')
# ['', 'www.google.com', None]
Here you go:
url = 'https://www.google.com:443'
first = url.find(':')
last = url.rfind(':')
protocol = url[:first]
domain = url[first+3:last]
port = url[last+1:]
A 'primitive' method:
from collections import namedtuple
def split_url(url):
split_1 = url.split('://')
split_2 = split_1[1].split(':')
protocol = split_1[0]
domain = split_2[0]
port = split_2[1]
url_split = namedtuple('url_split', ['protocol', 'domain', 'port'])
return url_split(protocol, domain, port)
So, for example:
s = 'https://www.google.com:443'
result = split_url(s)
Then we have:
result.protocol
>> 'https'
result.domain
>> 'www.google.com'
result.port
>> '443'
During a unittest I would like to compare a generated URL with a static one defined in the test. For this comparison it would be good to have a TestCase.assertURLEqual or similar which would let you compare two URLs in string format and result in True if all query and fragment components were present and equal but not necessarily in order.
Before I go implement this myself, is this feature around already?
I don't know if there is something built-in, but you could simply use urlparse and check yourself for the query parameters since order is taken into account by default.
>>> import urlparse
>>> url1 = 'http://google.com/?a=1&b=2'
>>> url2 = 'http://google.com/?b=2&a=1'
>>> # parse url ignoring query params order
... def parse_url(url):
... u = urlparse.urlparse(url)
... q = u.query
... u = urlparse.urlparse(u.geturl().replace(q, ''))
... return (u, urlparse.parse_qs(q))
...
>>> parse_url(url1)
(ParseResult(scheme='http', netloc='google.com', path='/', params='', query='', fragment=''), {'a': ['1'], 'b': ['2']})
>>> def assert_url_equals(url1, url2):
... return parse_url(url1) == parse_url(url1)
...
>>> assert_url_equals(url1, url2)
True
Well this is not too hard to implement with urlparse in the standard library:
from urlparse import urlparse, parse_qs
def urlEq(url1, url2):
pr1 = urlparse(url1)
pr2 = urlparse(url2)
return (pr1.scheme == pr2.scheme and
pr1.netloc == pr2.netloc and
pr1.path == pr2.path and
parse_qs(pr1.query) == parse_qs(pr2.query))
# Prints True
print urlEq("http://foo.com/blah?bar=1&foo=2", "http://foo.com/blah?foo=2&bar=1")
# Prints False
print urlEq("http://foo.com/blah?bar=1&foo=2", "http://foo.com/blah?foo=4&bar=1")
Basically, compare everything that is parsed from the URL but use parse_qs to get a dictionary from the query string.
For example, the address is:
Address = http://lol1.domain.com:8888/some/page
I want to save the subdomain into a variable so i could do like so;
print SubAddr
>> lol1
Package tldextract makes this task very easy, and then you can use urlparse as suggested if you need any further information:
>>> import tldextract
>>> tldextract.extract("http://lol1.domain.com:8888/some/page"
ExtractResult(subdomain='lol1', domain='domain', suffix='com')
>>> tldextract.extract("http://sub.lol1.domain.com:8888/some/page"
ExtractResult(subdomain='sub.lol1', domain='domain', suffix='com')
>>> urlparse.urlparse("http://sub.lol1.domain.com:8888/some/page")
ParseResult(scheme='http', netloc='sub.lol1.domain.com:8888', path='/some/page', params='', query='', fragment='')
Note that tldextract properly handles sub-domains.
urlparse.urlparse will split the URL into protocol, location, port, etc. You can then split the location by . to get the subdomain.
import urlparse
url = urlparse.urlparse(address)
subdomain = url.hostname.split('.')[0]
Modified version of the fantastic answer here: How to extract top-level domain name (TLD) from URL
You will need the list of effective tlds from here
from __future__ import with_statement
from urlparse import urlparse
# load tlds, ignore comments and empty lines:
with open("effective_tld_names.dat.txt") as tldFile:
tlds = [line.strip() for line in tldFile if line[0] not in "/\n"]
class DomainParts(object):
def __init__(self, domain_parts, tld):
self.domain = None
self.subdomains = None
self.tld = tld
if domain_parts:
self.domain = domain_parts[-1]
if len(domain_parts) > 1:
self.subdomains = domain_parts[:-1]
def get_domain_parts(url, tlds):
urlElements = urlparse(url).hostname.split('.')
# urlElements = ["abcde","co","uk"]
for i in range(-len(urlElements),0):
lastIElements = urlElements[i:]
# i=-3: ["abcde","co","uk"]
# i=-2: ["co","uk"]
# i=-1: ["uk"] etc
candidate = ".".join(lastIElements) # abcde.co.uk, co.uk, uk
wildcardCandidate = ".".join(["*"]+lastIElements[1:]) # *.co.uk, *.uk, *
exceptionCandidate = "!"+candidate
# match tlds:
if (exceptionCandidate in tlds):
return ".".join(urlElements[i:])
if (candidate in tlds or wildcardCandidate in tlds):
return DomainParts(urlElements[:i], '.'.join(urlElements[i:]))
# returns ["abcde"]
raise ValueError("Domain not in global list of TLDs")
domain_parts = get_domain_parts("http://sub2.sub1.example.co.uk:80",tlds)
print "Domain:", domain_parts.domain
print "Subdomains:", domain_parts.subdomains or "None"
print "TLD:", domain_parts.tld
Gives you:
Domain: example
Subdomains: ['sub2', 'sub1']
TLD: co.uk
A very basic approach, without any sanity checking could look like:
address = 'http://lol1.domain.com:8888/some/page'
host = address.partition('://')[2]
sub_addr = host.partition('.')[0]
print sub_addr
This of course assumes that when you say 'subdomain' you mean the first part of a host name, so in the following case, 'www' would be the subdomain:
http://www.google.com/
Is that what you mean?
What you are looking for is in:
http://docs.python.org/library/urlparse.html
for example:
".".join(urlparse('http://www.my.cwi.nl:80/%7Eguido/Python.html').netloc.split(".")[:-2])
Will do the job for you (will return "www.my")
For extracting the hostname, I'd use urlparse from urllib2:
>>> from urllib2 import urlparse
>>> a = "http://lol1.domain.com:8888/some/page"
>>> urlparse.urlparse(a).hostname
'lol1.domain.com'
As to how to extract the subdomain, you need to cover for the case that there FQDN could be longer. How you do this would depend on your purposes. I might suggest stripping off the two right most components.
E.g.
>>> urlparse.urlparse(a).hostname.rpartition('.')[0].rpartition('.')[0]
'lol1'
We can use https://github.com/john-kurkowski/tldextract for this problem...
It's easy.
>>> ext = tldextract.extract('http://forums.bbc.co.uk')
>>> (ext.subdomain, ext.domain, ext.suffix)
('forums', 'bbc', 'co.uk')
tldextract separate the TLD from the registered domain and subdomains of a URL.
Installation
pip install tldextract
For the current question:
import tldextract
address = 'http://lol1.domain.com:8888/some/page'
domain = tldextract.extract(address).domain
print("Extracted domain name : ", domain)
The output:
Extracted domain name : domain
In addition, there is a number of examples which is extremely related with the usage of tldextract.extract side.
First of All import tldextract, as this splits the URL into its constituents like: subdomain. domain, and suffix.
import tldextract
Then declare a variable (say ext) that stores the results of the query. We also have to provide it with the URL in parenthesis with double quotes. As shown below:
ext = tldextract.extract("http://lol1.domain.com:8888/some/page")
If we simply try to run ext variable, the output will be:
ExtractResult(subdomain='lol1', domain='domain', suffix='com')
Then if you want to use only subdomain or domain or suffix, then use any of the below code, respectively.
ext.subdomain
The result will be:
'lol1'
ext.domain
The result will be:
'domain'
ext.suffix
The result will be:
'com'
Also, if you want to store the results only of subdomain in a variable, then use the code below:
Sub_Domain = ext.subdomain
Then Print Sub_Domain
Sub_Domain
The result will be:
'lol1'
Using python 3 (I'm using 3.9 to be specific), you can do the following:
from urllib.parse import urlparse
address = 'http://lol1.domain.com:8888/some/page'
url = urlparse(address)
url.hostname.split('.')[0]
import re
def extract_domain(domain):
domain = re.sub('http(s)?://|(\:|/)(.*)|','', domain)
matches = re.findall("([a-z0-9][a-z0-9\-]{1,63}\.[a-z\.]{2,6})$", domain)
if matches:
return matches[0]
else:
return domain
def extract_subdomains(domain):
subdomains = domain = re.sub('http(s)?://|(\:|/)(.*)|','', domain)
domain = extract_domain(subdomains)
subdomains = re.sub('\.?'+domain,'', subdomains)
return subdomains
Example to fetch subdomains:
print(extract_subdomains('http://lol1.domain.com:8888/some/page'))
print(extract_subdomains('kota-tangerang.kpu.go.id'))
Outputs:
lol1
kota-tangerang
Example to fetch domain
print(extract_domain('http://lol1.domain.com:8888/some/page'))
print(extract_domain('kota-tangerang.kpu.go.id'))
Outputs:
domain.com
kpu.go.id
Standardize all domains to start with www. unless they have a subdomain.
from urllib.parse import urlparse
def has_subdomain(url):
if len(url.split('.')) > 2:
return True
else:
return False
domain = urlparse(url).netloc
if not has_subdomain(url):
domain_name = 'www.' + domain
url = urlparse(url).scheme + '://' + domain