Canonical URL compare in Python?

Canonical URL compare in Python? - python

Are there any tools to do a URL compare in Python?
For example, if I have http://google.com and google.com/ I'd like to know that they are likely to be the same site.
If I were to construct a rule manually, I might Uppercase it, then strip off the http:// portion, and drop anything after the last alpha-numeric character.. But I can see failures of this, as I'm sure you can as well.
Is there a library that does this? How would you do it?

This off the top of my head:
def canonical_url(u):
u = u.lower()
if u.startswith("http://"):
u = u[7:]
if u.startswith("www."):
u = u[4:]
if u.endswith("/"):
u = u[:-1]
return u
def same_urls(u1, u2):
return canonical_url(u1) == canonical_url(u2)
Obviously, there's lots of room for more fiddling with this. Regexes might be better than startswith and endswith, but you get the idea.

There is quite a bit to creating a canonical url apparently.
The url-normalize library is best that I have tested.
Depending on the source of your urls you may wish to clean them of other standard parameters such as UTM codes. w3lib.url.url_query_cleaner is useful for this.
Combining this with Ned Batchelder's answer could look something like:
Code:
from w3lib.url import url_query_cleaner
from url_normalize import url_normalize
urls = ['google.com',
'google.com/',
'http://google.com/',
'http://google.com',
'http://google.com?',
'http://google.com/?',
'http://google.com//',
'http://google.com?utm_source=Google']
def canonical_url(u):
u = url_normalize(u)
u = url_query_cleaner(u,parameterlist = ['utm_source','utm_medium','utm_campaign','utm_term','utm_content'],remove=True)
if u.startswith("http://"):
u = u[7:]
if u.startswith("https://"):
u = u[8:]
if u.startswith("www."):
u = u[4:]
if u.endswith("/"):
u = u[:-1]
return u
list(map(canonical_url,urls))
Result:
['google.com',
'google.com',
'google.com',
'google.com',
'google.com',
'google.com',
'google.com',
'google.com']

You could look up the names using dns and see if they point to the same ip. Some minor string processing may be required to remove confusing chars.
from socket import gethostbyname_ex
urls = ['http://google.com','google.com/','www.google.com/','news.google.com']
data = []
for orginalName in urls:
print 'url:',orginalName
name = orginalName.strip()
name = name.replace( 'http://','')
name = name.replace( 'http:','')
if name.find('/') > 0:
name = name[:name.find('/')]
if name.find('\\') > 0:
name = name[:name.find('\\')]
print 'dns lookup:', name
if name:
try:
result = gethostbyname_ex(name)
except:
continue # Unable to resolve
for ip in result[2]:
print 'ip:', ip
data.append( (ip, orginalName) )
print data
result:
url: http://google.com
dns lookup: google.com
ip: 66.102.11.104
url: google.com/
dns lookup: google.com
ip: 66.102.11.104
url: www.google.com/
dns lookup: www.google.com
ip: 66.102.11.104
url: news.google.com
dns lookup: news.google.com
ip: 66.102.11.104
[('66.102.11.104', 'http://google.com'), ('66.102.11.104', 'google.com/'), ('66.102.11.104', 'www.google.com/'), ('66.102.11.104', 'news.google.com')]

It's not 'fuzzy', it just find the 'distance' between two strings:
http://pypi.python.org/pypi/python-Levenshtein/
I would remove all portions which are semantically meaningful to URL parsing (protocol, slashes, etc.), normalize to lowercase, then perform a levenstein distance, then from there decide how many difference is an acceptable threshold.
Just an idea.

Related

Python - get TLD

I have a problem in function which should remove tld from domain. If domain has some subdomain it works correctly. For example:
Input: asdf.xyz.example.com
Output: asdf.xyz.example
Problem is when the domain has not any subdomain, there is dot in front of domain
Input: example.com
Output: .example
This is my code:
res = get_tld(domain, as_object=True, fail_silently=True, fix_protocol=True)
domain = '.'.join([res.subdomain, res.domain])
Function get_tld is from tld library
Could someone help me how to solve this problem?

With a very simple string manipulation, is this what you are looking for?
d1 = 'asdf.xyz.example.com'
output = '.'.join(d1.split('.')[:-1])
# output = 'asdf.xyz.example'
d2 = 'example.com'
output = '.'.join(d2.split('.')[:-1])
# output = 'example'

You can use filtering. It looks like get_tld works as intended but join is incorrect
domain = '.'.join(filter(lambda x: len(x), [res.subdomain, res.domain]))

another simple version is this:
def remove_tld(url):
*base, tld = url.split(".")
return ".".join(base)
url = "asdf.xyz.example.com"
print(remove_tld(url)) # asdf.xyz.example
url = "example.com"
print(remove_tld(url)) # example
*base, tld = url.split(".") puts the TLD in tld and everything else in base. then you just join tĥat with ".".join(base).

How can I split a URL into three different variables

I want to split a URL into three strings.
Example:
https://www.google.com:443
http://amazon.com:467
I would like the output to be:
string 1: https or http
string 2: www.google.com or amazon.com
string 3: 443 or 467
The above output is based on the example provided. Basically I want to split the string into protocol, domain and port and assign to three different variables.

ULRs are more complicated than one might think which is why it's generally a good idea to use proven code to parse them and handle unexpected edge cases. Python has urllib.parse in the library, which you should use rather than trying to parse this your self.
The parts you want are in the scheme, hostname, and port properties of the object returned from urlsparse()
For example:
from urllib.parse import urlparse
def getParts(url_string):
p = urlparse(url_string)
return [p.scheme, p.hostname, p.port]
getParts('https://www.google.com:443')
# ['https', 'www.google.com', 443]
getParts('http://amazon.com:467')
# ['http', 'amazon.com', 467]
# surprising, but valid url:
getParts('https://en.wikipedia.org:443/wiki/Template:Welcome')
# ['https', 'en.wikipedia.org', 443]
# missing parts:
getParts('//www.google.com/example/home')
# ['', 'www.google.com', None]

Here you go:
url = 'https://www.google.com:443'
first = url.find(':')
last = url.rfind(':')
protocol = url[:first]
domain = url[first+3:last]
port = url[last+1:]

A 'primitive' method:
from collections import namedtuple
def split_url(url):
split_1 = url.split('://')
split_2 = split_1[1].split(':')
protocol = split_1[0]
domain = split_2[0]
port = split_2[1]
url_split = namedtuple('url_split', ['protocol', 'domain', 'port'])
return url_split(protocol, domain, port)
So, for example:
s = 'https://www.google.com:443'
result = split_url(s)
Then we have:
result.protocol
>> 'https'
result.domain
>> 'www.google.com'
result.port
>> '443'

returning 'A' DNS record in dnspython

I am using dnspython to get the 'A' record and return the result (IP address for a given domain).
I have this simple testing python script:
import dns.resolver
def resolveDNS():
domain = "google.com"
resolver = dns.resolver.Resolver();
answer = resolver.query(domain , "A")
return answer
resultDNS = resolveDNS()
print resultDNS
However, the output is:
<dns.resolver.Answer object at 0x0000000004F56C50>
I need to get the result as a string. If it is an array of strings, how to return it?

The answer(s) you get is actually an iterator of 'A' records, so you'll need to iterate through those:
answers = resolver.query(domain, 'A')
for answer in answers:
print (answer.to_text())

import dns.resolver
def resolveDNS():
domain = "google.com"
resolver = dns.resolver.Resolver();
answer = resolver.query(domain , "A")
return answer
resultDNS = resolveDNS()
answer = ''
for item in resultDNS:
resultant_str = ','.join([str(item), answer])
print resultant_str
So now the resultant_str is a variable of type string that holds A records separated by comma.

Get subdomain from URL using Python

For example, the address is:
Address = http://lol1.domain.com:8888/some/page
I want to save the subdomain into a variable so i could do like so;
print SubAddr
>> lol1

Package tldextract makes this task very easy, and then you can use urlparse as suggested if you need any further information:
>>> import tldextract
>>> tldextract.extract("http://lol1.domain.com:8888/some/page"
ExtractResult(subdomain='lol1', domain='domain', suffix='com')
>>> tldextract.extract("http://sub.lol1.domain.com:8888/some/page"
ExtractResult(subdomain='sub.lol1', domain='domain', suffix='com')
>>> urlparse.urlparse("http://sub.lol1.domain.com:8888/some/page")
ParseResult(scheme='http', netloc='sub.lol1.domain.com:8888', path='/some/page', params='', query='', fragment='')
Note that tldextract properly handles sub-domains.

urlparse.urlparse will split the URL into protocol, location, port, etc. You can then split the location by . to get the subdomain.
import urlparse
url = urlparse.urlparse(address)
subdomain = url.hostname.split('.')[0]

Modified version of the fantastic answer here: How to extract top-level domain name (TLD) from URL
You will need the list of effective tlds from here
from __future__ import with_statement
from urlparse import urlparse
# load tlds, ignore comments and empty lines:
with open("effective_tld_names.dat.txt") as tldFile:
tlds = [line.strip() for line in tldFile if line[0] not in "/\n"]
class DomainParts(object):
def __init__(self, domain_parts, tld):
self.domain = None
self.subdomains = None
self.tld = tld
if domain_parts:
self.domain = domain_parts[-1]
if len(domain_parts) > 1:
self.subdomains = domain_parts[:-1]
def get_domain_parts(url, tlds):
urlElements = urlparse(url).hostname.split('.')
# urlElements = ["abcde","co","uk"]
for i in range(-len(urlElements),0):
lastIElements = urlElements[i:]
# i=-3: ["abcde","co","uk"]
# i=-2: ["co","uk"]
# i=-1: ["uk"] etc
candidate = ".".join(lastIElements) # abcde.co.uk, co.uk, uk
wildcardCandidate = ".".join(["*"]+lastIElements[1:]) # *.co.uk, *.uk, *
exceptionCandidate = "!"+candidate
# match tlds:
if (exceptionCandidate in tlds):
return ".".join(urlElements[i:])
if (candidate in tlds or wildcardCandidate in tlds):
return DomainParts(urlElements[:i], '.'.join(urlElements[i:]))
# returns ["abcde"]
raise ValueError("Domain not in global list of TLDs")
domain_parts = get_domain_parts("http://sub2.sub1.example.co.uk:80",tlds)
print "Domain:", domain_parts.domain
print "Subdomains:", domain_parts.subdomains or "None"
print "TLD:", domain_parts.tld
Gives you:
Domain: example
Subdomains: ['sub2', 'sub1']
TLD: co.uk

A very basic approach, without any sanity checking could look like:
address = 'http://lol1.domain.com:8888/some/page'
host = address.partition('://')[2]
sub_addr = host.partition('.')[0]
print sub_addr
This of course assumes that when you say 'subdomain' you mean the first part of a host name, so in the following case, 'www' would be the subdomain:
http://www.google.com/
Is that what you mean?

What you are looking for is in:
http://docs.python.org/library/urlparse.html
for example:
".".join(urlparse('http://www.my.cwi.nl:80/%7Eguido/Python.html').netloc.split(".")[:-2])
Will do the job for you (will return "www.my")

For extracting the hostname, I'd use urlparse from urllib2:
>>> from urllib2 import urlparse
>>> a = "http://lol1.domain.com:8888/some/page"
>>> urlparse.urlparse(a).hostname
'lol1.domain.com'
As to how to extract the subdomain, you need to cover for the case that there FQDN could be longer. How you do this would depend on your purposes. I might suggest stripping off the two right most components.
E.g.
>>> urlparse.urlparse(a).hostname.rpartition('.')[0].rpartition('.')[0]
'lol1'

We can use https://github.com/john-kurkowski/tldextract for this problem...
It's easy.
>>> ext = tldextract.extract('http://forums.bbc.co.uk')
>>> (ext.subdomain, ext.domain, ext.suffix)
('forums', 'bbc', 'co.uk')

tldextract separate the TLD from the registered domain and subdomains of a URL.
Installation
pip install tldextract
For the current question:
import tldextract
address = 'http://lol1.domain.com:8888/some/page'
domain = tldextract.extract(address).domain
print("Extracted domain name : ", domain)
The output:
Extracted domain name : domain
In addition, there is a number of examples which is extremely related with the usage of tldextract.extract side.

First of All import tldextract, as this splits the URL into its constituents like: subdomain. domain, and suffix.
import tldextract
Then declare a variable (say ext) that stores the results of the query. We also have to provide it with the URL in parenthesis with double quotes. As shown below:
ext = tldextract.extract("http://lol1.domain.com:8888/some/page")
If we simply try to run ext variable, the output will be:
ExtractResult(subdomain='lol1', domain='domain', suffix='com')
Then if you want to use only subdomain or domain or suffix, then use any of the below code, respectively.
ext.subdomain
The result will be:
'lol1'
ext.domain
The result will be:
'domain'
ext.suffix
The result will be:
'com'
Also, if you want to store the results only of subdomain in a variable, then use the code below:
Sub_Domain = ext.subdomain
Then Print Sub_Domain
Sub_Domain
The result will be:
'lol1'

Using python 3 (I'm using 3.9 to be specific), you can do the following:
from urllib.parse import urlparse
address = 'http://lol1.domain.com:8888/some/page'
url = urlparse(address)
url.hostname.split('.')[0]

import re
def extract_domain(domain):
domain = re.sub('http(s)?://|(\:|/)(.*)|','', domain)
matches = re.findall("([a-z0-9][a-z0-9\-]{1,63}\.[a-z\.]{2,6})$", domain)
if matches:
return matches[0]
else:
return domain
def extract_subdomains(domain):
subdomains = domain = re.sub('http(s)?://|(\:|/)(.*)|','', domain)
domain = extract_domain(subdomains)
subdomains = re.sub('\.?'+domain,'', subdomains)
return subdomains
Example to fetch subdomains:
print(extract_subdomains('http://lol1.domain.com:8888/some/page'))
print(extract_subdomains('kota-tangerang.kpu.go.id'))
Outputs:
lol1
kota-tangerang
Example to fetch domain
print(extract_domain('http://lol1.domain.com:8888/some/page'))
print(extract_domain('kota-tangerang.kpu.go.id'))
Outputs:
domain.com
kpu.go.id

Standardize all domains to start with www. unless they have a subdomain.
from urllib.parse import urlparse
def has_subdomain(url):
if len(url.split('.')) > 2:
return True
else:
return False
domain = urlparse(url).netloc
if not has_subdomain(url):
domain_name = 'www.' + domain
url = urlparse(url).scheme + '://' + domain

Add params to given URL in Python

Suppose I was given a URL.
It might already have GET parameters (e.g. http://example.com/search?q=question) or it might not (e.g. http://example.com/).
And now I need to add some parameters to it like {'lang':'en','tag':'python'}. In the first case I'm going to have http://example.com/search?q=question&lang=en&tag=python and in the second — http://example.com/search?lang=en&tag=python.
Is there any standard way to do this?

There are a couple of quirks with the urllib and urlparse modules. Here's a working example:
try:
import urlparse
from urllib import urlencode
except: # For Python 3
import urllib.parse as urlparse
from urllib.parse import urlencode
url = "http://stackoverflow.com/search?q=question"
params = {'lang':'en','tag':'python'}
url_parts = list(urlparse.urlparse(url))
query = dict(urlparse.parse_qsl(url_parts[4]))
query.update(params)
url_parts[4] = urlencode(query)
print(urlparse.urlunparse(url_parts))
ParseResult, the result of urlparse(), is read-only and we need to convert it to a list before we can attempt to modify its data.

Outsource it to the battle tested requests library.
This is how I will do it:
from requests.models import PreparedRequest
url = 'http://example.com/search?q=question'
params = {'lang':'en','tag':'python'}
req = PreparedRequest()
req.prepare_url(url, params)
print(req.url)

Why
I've been not satisfied with all the solutions on this page (come on, where is our favorite copy-paste thing?) so I wrote my own based on answers here. It tries to be complete and more Pythonic. I've added a handler for dict and bool values in arguments to be more consumer-side (JS) friendly, but they are yet optional, you can drop them.
How it works
Test 1: Adding new arguments, handling Arrays and Bool values:
url = 'http://stackoverflow.com/test'
new_params = {'answers': False, 'data': ['some','values']}
add_url_params(url, new_params) == \
'http://stackoverflow.com/test?data=some&data=values&answers=false'
Test 2: Rewriting existing args, handling DICT values:
url = 'http://stackoverflow.com/test/?question=false'
new_params = {'question': {'__X__':'__Y__'}}
add_url_params(url, new_params) == \
'http://stackoverflow.com/test/?question=%7B%22__X__%22%3A+%22__Y__%22%7D'
Talk is cheap. Show me the code.
Code itself. I've tried to describe it in details:
from json import dumps
try:
from urllib import urlencode, unquote
from urlparse import urlparse, parse_qsl, ParseResult
except ImportError:
# Python 3 fallback
from urllib.parse import (
urlencode, unquote, urlparse, parse_qsl, ParseResult
)
def add_url_params(url, params):
""" Add GET params to provided URL being aware of existing.
:param url: string of target URL
:param params: dict containing requested params to be added
:return: string with updated URL
>> url = 'http://stackoverflow.com/test?answers=true'
>> new_params = {'answers': False, 'data': ['some','values']}
>> add_url_params(url, new_params)
'http://stackoverflow.com/test?data=some&data=values&answers=false'
"""
# Unquoting URL first so we don't loose existing args
url = unquote(url)
# Extracting url info
parsed_url = urlparse(url)
# Extracting URL arguments from parsed URL
get_args = parsed_url.query
# Converting URL arguments to dict
parsed_get_args = dict(parse_qsl(get_args))
# Merging URL arguments dict with new params
parsed_get_args.update(params)
# Bool and Dict values should be converted to json-friendly values
# you may throw this part away if you don't like it :)
parsed_get_args.update(
{k: dumps(v) for k, v in parsed_get_args.items()
if isinstance(v, (bool, dict))}
)
# Converting URL argument to proper query string
encoded_get_args = urlencode(parsed_get_args, doseq=True)
# Creating new parsed result object based on provided with new
# URL arguments. Same thing happens inside of urlparse.
new_url = ParseResult(
parsed_url.scheme, parsed_url.netloc, parsed_url.path,
parsed_url.params, encoded_get_args, parsed_url.fragment
).geturl()
return new_url
Please be aware that there may be some issues, if you'll find one please let me know and we will make this thing better

You want to use URL encoding if the strings can have arbitrary data (for example, characters such as ampersands, slashes, etc. will need to be encoded).
Check out urllib.urlencode:
>>> import urllib
>>> urllib.urlencode({'lang':'en','tag':'python'})
'lang=en&tag=python'
In python3:
from urllib import parse
parse.urlencode({'lang':'en','tag':'python'})

You can also use the furl module https://github.com/gruns/furl
>>> from furl import furl
>>> print furl('http://example.com/search?q=question').add({'lang':'en','tag':'python'}).url
http://example.com/search?q=question&lang=en&tag=python

If you are using the requests lib:
import requests
...
params = {'tag': 'python'}
requests.get(url, params=params)

Based on this answer, one-liner for simple cases (Python 3 code):
from urllib.parse import urlparse, urlencode
url = "https://stackoverflow.com/search?q=question"
params = {'lang':'en','tag':'python'}
url += ('&' if urlparse(url).query else '?') + urlencode(params)
or:
url += ('&', '?')[urlparse(url).query == ''] + urlencode(params)

I find this more elegant than the two top answers:
from urllib.parse import urlencode, urlparse, parse_qs
def merge_url_query_params(url: str, additional_params: dict) -> str:
url_components = urlparse(url)
original_params = parse_qs(url_components.query)
# Before Python 3.5 you could update original_params with
# additional_params, but here all the variables are immutable.
merged_params = {**original_params, **additional_params}
updated_query = urlencode(merged_params, doseq=True)
# _replace() is how you can create a new NamedTuple with a changed field
return url_components._replace(query=updated_query).geturl()
assert merge_url_query_params(
'http://example.com/search?q=question',
{'lang':'en','tag':'python'},
) == 'http://example.com/search?q=question&lang=en&tag=python'
The most important things I dislike in the top answers (they are nevertheless good):
Łukasz: having to remember the index at which the query is in the URL components
Sapphire64: the very verbose way of creating the updated ParseResult
What's bad about my response is the magically looking dict merge using unpacking, but I prefer that to updating an already existing dictionary because of my prejudice against mutability.

Yes: use urllib.
From the examples in the documentation:
>>> import urllib
>>> params = urllib.urlencode({'spam': 1, 'eggs': 2, 'bacon': 0})
>>> f = urllib.urlopen("http://www.musi-cal.com/cgi-bin/query?%s" % params)
>>> print f.geturl() # Prints the final URL with parameters.
>>> print f.read() # Prints the contents

python3, self explanatory I guess
from urllib.parse import urlparse, urlencode, parse_qsl
url = 'https://www.linkedin.com/jobs/search?keywords=engineer'
parsed = urlparse(url)
current_params = dict(parse_qsl(parsed.query))
new_params = {'location': 'United States'}
merged_params = urlencode({**current_params, **new_params})
parsed = parsed._replace(query=merged_params)
print(parsed.geturl())
# https://www.linkedin.com/jobs/search?keywords=engineer&location=United+States

I liked Łukasz version, but since urllib and urllparse functions are somewhat awkward to use in this case, I think it's more straightforward to do something like this:
params = urllib.urlencode(params)
if urlparse.urlparse(url)[4]:
print url + '&' + params
else:
print url + '?' + params

Use the various urlparse functions to tear apart the existing URL, urllib.urlencode() on the combined dictionary, then urlparse.urlunparse() to put it all back together again.
Or just take the result of urllib.urlencode() and concatenate it to the URL appropriately.

Yet another answer:
def addGetParameters(url, newParams):
(scheme, netloc, path, params, query, fragment) = urlparse.urlparse(url)
queryList = urlparse.parse_qsl(query, keep_blank_values=True)
for key in newParams:
queryList.append((key, newParams[key]))
return urlparse.urlunparse((scheme, netloc, path, params, urllib.urlencode(queryList), fragment))

In python 2.5
import cgi
import urllib
import urlparse
def add_url_param(url, **params):
n=3
parts = list(urlparse.urlsplit(url))
d = dict(cgi.parse_qsl(parts[n])) # use cgi.parse_qs for list values
d.update(params)
parts[n]=urllib.urlencode(d)
return urlparse.urlunsplit(parts)
url = "http://stackoverflow.com/search?q=question"
add_url_param(url, lang='en') == "http://stackoverflow.com/search?q=question&lang=en"

Here is how I implemented it.
import urllib
params = urllib.urlencode({'lang':'en','tag':'python'})
url = ''
if request.GET:
url = request.url + '&' + params
else:
url = request.url + '?' + params
Worked like a charm. However, I would have liked a more cleaner way to implement this.
Another way of implementing the above is put it in a method.
import urllib
def add_url_param(request, **params):
new_url = ''
_params = dict(**params)
_params = urllib.urlencode(_params)
if _params:
if request.GET:
new_url = request.url + '&' + _params
else:
new_url = request.url + '?' + _params
else:
new_url = request.url
return new_ur

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Canonical URL compare in Python? - python

Related

Python - get TLD

How can I split a URL into three different variables

returning 'A' DNS record in dnspython

Get subdomain from URL using Python

Add params to given URL in Python

Categories

Resources