During a unittest I would like to compare a generated URL with a static one defined in the test. For this comparison it would be good to have a TestCase.assertURLEqual or similar which would let you compare two URLs in string format and result in True if all query and fragment components were present and equal but not necessarily in order.
Before I go implement this myself, is this feature around already?
I don't know if there is something built-in, but you could simply use urlparse and check yourself for the query parameters since order is taken into account by default.
>>> import urlparse
>>> url1 = 'http://google.com/?a=1&b=2'
>>> url2 = 'http://google.com/?b=2&a=1'
>>> # parse url ignoring query params order
... def parse_url(url):
... u = urlparse.urlparse(url)
... q = u.query
... u = urlparse.urlparse(u.geturl().replace(q, ''))
... return (u, urlparse.parse_qs(q))
...
>>> parse_url(url1)
(ParseResult(scheme='http', netloc='google.com', path='/', params='', query='', fragment=''), {'a': ['1'], 'b': ['2']})
>>> def assert_url_equals(url1, url2):
... return parse_url(url1) == parse_url(url1)
...
>>> assert_url_equals(url1, url2)
True
Well this is not too hard to implement with urlparse in the standard library:
from urlparse import urlparse, parse_qs
def urlEq(url1, url2):
pr1 = urlparse(url1)
pr2 = urlparse(url2)
return (pr1.scheme == pr2.scheme and
pr1.netloc == pr2.netloc and
pr1.path == pr2.path and
parse_qs(pr1.query) == parse_qs(pr2.query))
# Prints True
print urlEq("http://foo.com/blah?bar=1&foo=2", "http://foo.com/blah?foo=2&bar=1")
# Prints False
print urlEq("http://foo.com/blah?bar=1&foo=2", "http://foo.com/blah?foo=4&bar=1")
Basically, compare everything that is parsed from the URL but use parse_qs to get a dictionary from the query string.
Related
How to ensure that the "plus" does not disappear in the QueryDict?
I am trying to parse the received get-query into a dict:
from urllib.parse import quote_plus
my_non_safe_string = "test=1+1" # This example, the string can be anything. (in GET query format)
QueryDict(my_non_safe_string)
out: <QueryDict: {'test': ['1 1']}>
my_safe_string = quote_plus("test=1+1") # 'test%3D1%2B1'
out: <QueryDict: {'test=1+1': ['']}>
I would like to get the following result:
<QueryDict: {'test=1+1': ['1+1']}>
How about this?
In [1]: from django.http import QueryDict
In [2]: from urllib.parse import quote_plus
In [3]: key = quote_plus("test=1+1")
In [4]: value = quote_plus("1+1")
In [5]: query_str = f"{key}={value}"
In [6]: QueryDict(query_str)
Out[6]: <QueryDict: {'test=1+1': ['1+1']}>
You need to percentage encode the plus, by use quote_plus you also encode the equal sign (=) and therefore the QueryDict can not parse it effectively:
my_safe_string = f'test={quote_plus("1+1")}'
this produces:
>>> from urllib.parse import quote_plus
>>> my_safe_string = f'test={quote_plus("1+1")}'
>>> QueryDict(my_safe_string)
<QueryDict: {'test': ['1+1']}>
If it is unclear if the key contains any characters that should be escaped, you can use:
key = 'test'
value = '1+1'
my_safe_string = f'{ quote_plus(key) }={ quote_plus(value) }'
I am having two inputs for my task
>>> uri = u'/shop/amazonwow/getstates/1'
>>> uri_regex = u'/shop/(?P<shopid>.+)$/getstates/(?P<countryid>.+)$/'
Here uri is the request url and also i am passing a uri pattern(uri_regex) with it.
I need to fetch all dynamic data from uri .We will decide which data is dynamic as per our uri_regex .
Example : Here uri_regex has shopid , countryid as regular expression pattern and url is having values `amazonwow , 1 at same indexes.
My output will be like :
out = {'shopid': 'amazonwow', 'countryid' :1,}
My Try :
>>> uri_list = uri.split('/')
[u'', u'shop', u'amazonwow', u'getstates', u'1']
>>> regex = uri_regex.split('/')
>>> regex
[u'', u'shop', u'(?P<shopid>.+)$', u'getstates', u'(?P<countryid>.+)$']
>>> out = {}
>>> for i in range(len(regex)):
if regex[i].startswith('(?') & regex[i].endswith(')$'):
key = regex[i][regex[i].find("<")+1:regex[i].find(">")]
out[key] = uri_list[i]
>>> print out
{u'shopid': u'amazonwow', u'countryid': u'1'}
>>>
Note : i tried this but i do not think it is proper solution to above problem. Please guide me if you guys have much better way.
import re
uri = u'/shop/amazonwow/getstates/1'
pattern = re.compile(u'shop/(.+)/getstates/(.+)')
if pattern.search(uri):
out['shopid'] = pattern.search(uri).groups()[0]
out['countryid'] = pattern.search(uri).groups()[1]
Output:
out = {'countryid': '1', 'shopid': 'amazonwow'}
My Try:
def fetch_uri_variables(uri, uri_regex):
"""
function to fetch dynamic variables passed in uri as per
regular expression defined into uri_regex
"""
out, uri_list, uri_regex = {}, uri.split('/'), uri_regex.split('/')
for pattern in range(len(uri_regex)):
if re.search('^(\(\?)(.*)(\)\$)$', uri_regex[pattern]):
out[re.search('\<(.*)\>', uri_regex[pattern]).group(1)] = \
uri_list[pattern]
return out
>>> uri
u'/testing/shop/amazonwow/getstates/1'
>>> uri_regex
u'/(?P<test>.+)$/shop/(?P<shopid>.+)$/getstates/(?P<countryid>.+)$/'
>>> fetch_uri_variables(uri, uri_regex)
{u'test': u'testing', u'countryid': u'1', u'shopid': u'amazonwow'}
>>>
For example, the address is:
Address = http://lol1.domain.com:8888/some/page
I want to save the subdomain into a variable so i could do like so;
print SubAddr
>> lol1
Package tldextract makes this task very easy, and then you can use urlparse as suggested if you need any further information:
>>> import tldextract
>>> tldextract.extract("http://lol1.domain.com:8888/some/page"
ExtractResult(subdomain='lol1', domain='domain', suffix='com')
>>> tldextract.extract("http://sub.lol1.domain.com:8888/some/page"
ExtractResult(subdomain='sub.lol1', domain='domain', suffix='com')
>>> urlparse.urlparse("http://sub.lol1.domain.com:8888/some/page")
ParseResult(scheme='http', netloc='sub.lol1.domain.com:8888', path='/some/page', params='', query='', fragment='')
Note that tldextract properly handles sub-domains.
urlparse.urlparse will split the URL into protocol, location, port, etc. You can then split the location by . to get the subdomain.
import urlparse
url = urlparse.urlparse(address)
subdomain = url.hostname.split('.')[0]
Modified version of the fantastic answer here: How to extract top-level domain name (TLD) from URL
You will need the list of effective tlds from here
from __future__ import with_statement
from urlparse import urlparse
# load tlds, ignore comments and empty lines:
with open("effective_tld_names.dat.txt") as tldFile:
tlds = [line.strip() for line in tldFile if line[0] not in "/\n"]
class DomainParts(object):
def __init__(self, domain_parts, tld):
self.domain = None
self.subdomains = None
self.tld = tld
if domain_parts:
self.domain = domain_parts[-1]
if len(domain_parts) > 1:
self.subdomains = domain_parts[:-1]
def get_domain_parts(url, tlds):
urlElements = urlparse(url).hostname.split('.')
# urlElements = ["abcde","co","uk"]
for i in range(-len(urlElements),0):
lastIElements = urlElements[i:]
# i=-3: ["abcde","co","uk"]
# i=-2: ["co","uk"]
# i=-1: ["uk"] etc
candidate = ".".join(lastIElements) # abcde.co.uk, co.uk, uk
wildcardCandidate = ".".join(["*"]+lastIElements[1:]) # *.co.uk, *.uk, *
exceptionCandidate = "!"+candidate
# match tlds:
if (exceptionCandidate in tlds):
return ".".join(urlElements[i:])
if (candidate in tlds or wildcardCandidate in tlds):
return DomainParts(urlElements[:i], '.'.join(urlElements[i:]))
# returns ["abcde"]
raise ValueError("Domain not in global list of TLDs")
domain_parts = get_domain_parts("http://sub2.sub1.example.co.uk:80",tlds)
print "Domain:", domain_parts.domain
print "Subdomains:", domain_parts.subdomains or "None"
print "TLD:", domain_parts.tld
Gives you:
Domain: example
Subdomains: ['sub2', 'sub1']
TLD: co.uk
A very basic approach, without any sanity checking could look like:
address = 'http://lol1.domain.com:8888/some/page'
host = address.partition('://')[2]
sub_addr = host.partition('.')[0]
print sub_addr
This of course assumes that when you say 'subdomain' you mean the first part of a host name, so in the following case, 'www' would be the subdomain:
http://www.google.com/
Is that what you mean?
What you are looking for is in:
http://docs.python.org/library/urlparse.html
for example:
".".join(urlparse('http://www.my.cwi.nl:80/%7Eguido/Python.html').netloc.split(".")[:-2])
Will do the job for you (will return "www.my")
For extracting the hostname, I'd use urlparse from urllib2:
>>> from urllib2 import urlparse
>>> a = "http://lol1.domain.com:8888/some/page"
>>> urlparse.urlparse(a).hostname
'lol1.domain.com'
As to how to extract the subdomain, you need to cover for the case that there FQDN could be longer. How you do this would depend on your purposes. I might suggest stripping off the two right most components.
E.g.
>>> urlparse.urlparse(a).hostname.rpartition('.')[0].rpartition('.')[0]
'lol1'
We can use https://github.com/john-kurkowski/tldextract for this problem...
It's easy.
>>> ext = tldextract.extract('http://forums.bbc.co.uk')
>>> (ext.subdomain, ext.domain, ext.suffix)
('forums', 'bbc', 'co.uk')
tldextract separate the TLD from the registered domain and subdomains of a URL.
Installation
pip install tldextract
For the current question:
import tldextract
address = 'http://lol1.domain.com:8888/some/page'
domain = tldextract.extract(address).domain
print("Extracted domain name : ", domain)
The output:
Extracted domain name : domain
In addition, there is a number of examples which is extremely related with the usage of tldextract.extract side.
First of All import tldextract, as this splits the URL into its constituents like: subdomain. domain, and suffix.
import tldextract
Then declare a variable (say ext) that stores the results of the query. We also have to provide it with the URL in parenthesis with double quotes. As shown below:
ext = tldextract.extract("http://lol1.domain.com:8888/some/page")
If we simply try to run ext variable, the output will be:
ExtractResult(subdomain='lol1', domain='domain', suffix='com')
Then if you want to use only subdomain or domain or suffix, then use any of the below code, respectively.
ext.subdomain
The result will be:
'lol1'
ext.domain
The result will be:
'domain'
ext.suffix
The result will be:
'com'
Also, if you want to store the results only of subdomain in a variable, then use the code below:
Sub_Domain = ext.subdomain
Then Print Sub_Domain
Sub_Domain
The result will be:
'lol1'
Using python 3 (I'm using 3.9 to be specific), you can do the following:
from urllib.parse import urlparse
address = 'http://lol1.domain.com:8888/some/page'
url = urlparse(address)
url.hostname.split('.')[0]
import re
def extract_domain(domain):
domain = re.sub('http(s)?://|(\:|/)(.*)|','', domain)
matches = re.findall("([a-z0-9][a-z0-9\-]{1,63}\.[a-z\.]{2,6})$", domain)
if matches:
return matches[0]
else:
return domain
def extract_subdomains(domain):
subdomains = domain = re.sub('http(s)?://|(\:|/)(.*)|','', domain)
domain = extract_domain(subdomains)
subdomains = re.sub('\.?'+domain,'', subdomains)
return subdomains
Example to fetch subdomains:
print(extract_subdomains('http://lol1.domain.com:8888/some/page'))
print(extract_subdomains('kota-tangerang.kpu.go.id'))
Outputs:
lol1
kota-tangerang
Example to fetch domain
print(extract_domain('http://lol1.domain.com:8888/some/page'))
print(extract_domain('kota-tangerang.kpu.go.id'))
Outputs:
domain.com
kpu.go.id
Standardize all domains to start with www. unless they have a subdomain.
from urllib.parse import urlparse
def has_subdomain(url):
if len(url.split('.')) > 2:
return True
else:
return False
domain = urlparse(url).netloc
if not has_subdomain(url):
domain_name = 'www.' + domain
url = urlparse(url).scheme + '://' + domain
I have following python code:
TRAC_REQUEST_LOCATION=""
TRAC_ENV=TRAC_ENV_PARENT+"/"+re.sub(r'^'+TRAC_REQUEST_LOCATION+'/([^/]+).*', r'\1', environ['REQUEST_URI'])
The content of environ['REQUEST_URI'] is something like that /abc/DEF and I want to get only abc, but it doesn't work. Only sometimes it works, but why?
Thanks for any advices.
EDIT:
Here is the new code consisting on the given answers:
def check_password(environ, user, password):
global acct_mgr, TRAC_ENV
TRAC_ENV = ''
if 'REQUEST_URI' in environ:
if '/' in environ['REQUEST_URI']:
TRAC_ENV = environ['REQUEST_URI'].split('/')[1]
else:
return None
But I get as TRAC_ENV things like /abc/ or /abc, but I need only the abc part.
What is wrong with the code?
Why do you need a regexp? Use urlparse (Python 2.x, there is a link for Python 3.x in there).
If you want to extract the first part of the request path this is the simplest solution:
TRAC_ENV = ''
if '/' in environ['REQUEST_URI']:
TRAC_ENV = environ['REQUEST_URI'].split('/')[1]
EDIT
An example usage:
>>> def trac_env(environ):
... trac_env = ''
... if '/' in environ['REQUEST_URI']:
... trac_env = environ['REQUEST_URI'].split('/')[1]
... return trac_env
...
>>> trac_env({'REQUEST_URI': ''})
''
>>> trac_env({'REQUEST_URI': '/'})
''
>>> trac_env({'REQUEST_URI': '/foo'})
'foo'
>>> trac_env({'REQUEST_URI': '/foo/'})
'foo'
>>> trac_env({'REQUEST_URI': '/foo/bar'})
'foo'
>>> trac_env({'REQUEST_URI': '/foo/bar/'})
'foo'
I am back to my code above and it works fine now.
Perhaps the update of the components was the solution.
Suppose I was given a URL.
It might already have GET parameters (e.g. http://example.com/search?q=question) or it might not (e.g. http://example.com/).
And now I need to add some parameters to it like {'lang':'en','tag':'python'}. In the first case I'm going to have http://example.com/search?q=question&lang=en&tag=python and in the second — http://example.com/search?lang=en&tag=python.
Is there any standard way to do this?
There are a couple of quirks with the urllib and urlparse modules. Here's a working example:
try:
import urlparse
from urllib import urlencode
except: # For Python 3
import urllib.parse as urlparse
from urllib.parse import urlencode
url = "http://stackoverflow.com/search?q=question"
params = {'lang':'en','tag':'python'}
url_parts = list(urlparse.urlparse(url))
query = dict(urlparse.parse_qsl(url_parts[4]))
query.update(params)
url_parts[4] = urlencode(query)
print(urlparse.urlunparse(url_parts))
ParseResult, the result of urlparse(), is read-only and we need to convert it to a list before we can attempt to modify its data.
Outsource it to the battle tested requests library.
This is how I will do it:
from requests.models import PreparedRequest
url = 'http://example.com/search?q=question'
params = {'lang':'en','tag':'python'}
req = PreparedRequest()
req.prepare_url(url, params)
print(req.url)
Why
I've been not satisfied with all the solutions on this page (come on, where is our favorite copy-paste thing?) so I wrote my own based on answers here. It tries to be complete and more Pythonic. I've added a handler for dict and bool values in arguments to be more consumer-side (JS) friendly, but they are yet optional, you can drop them.
How it works
Test 1: Adding new arguments, handling Arrays and Bool values:
url = 'http://stackoverflow.com/test'
new_params = {'answers': False, 'data': ['some','values']}
add_url_params(url, new_params) == \
'http://stackoverflow.com/test?data=some&data=values&answers=false'
Test 2: Rewriting existing args, handling DICT values:
url = 'http://stackoverflow.com/test/?question=false'
new_params = {'question': {'__X__':'__Y__'}}
add_url_params(url, new_params) == \
'http://stackoverflow.com/test/?question=%7B%22__X__%22%3A+%22__Y__%22%7D'
Talk is cheap. Show me the code.
Code itself. I've tried to describe it in details:
from json import dumps
try:
from urllib import urlencode, unquote
from urlparse import urlparse, parse_qsl, ParseResult
except ImportError:
# Python 3 fallback
from urllib.parse import (
urlencode, unquote, urlparse, parse_qsl, ParseResult
)
def add_url_params(url, params):
""" Add GET params to provided URL being aware of existing.
:param url: string of target URL
:param params: dict containing requested params to be added
:return: string with updated URL
>> url = 'http://stackoverflow.com/test?answers=true'
>> new_params = {'answers': False, 'data': ['some','values']}
>> add_url_params(url, new_params)
'http://stackoverflow.com/test?data=some&data=values&answers=false'
"""
# Unquoting URL first so we don't loose existing args
url = unquote(url)
# Extracting url info
parsed_url = urlparse(url)
# Extracting URL arguments from parsed URL
get_args = parsed_url.query
# Converting URL arguments to dict
parsed_get_args = dict(parse_qsl(get_args))
# Merging URL arguments dict with new params
parsed_get_args.update(params)
# Bool and Dict values should be converted to json-friendly values
# you may throw this part away if you don't like it :)
parsed_get_args.update(
{k: dumps(v) for k, v in parsed_get_args.items()
if isinstance(v, (bool, dict))}
)
# Converting URL argument to proper query string
encoded_get_args = urlencode(parsed_get_args, doseq=True)
# Creating new parsed result object based on provided with new
# URL arguments. Same thing happens inside of urlparse.
new_url = ParseResult(
parsed_url.scheme, parsed_url.netloc, parsed_url.path,
parsed_url.params, encoded_get_args, parsed_url.fragment
).geturl()
return new_url
Please be aware that there may be some issues, if you'll find one please let me know and we will make this thing better
You want to use URL encoding if the strings can have arbitrary data (for example, characters such as ampersands, slashes, etc. will need to be encoded).
Check out urllib.urlencode:
>>> import urllib
>>> urllib.urlencode({'lang':'en','tag':'python'})
'lang=en&tag=python'
In python3:
from urllib import parse
parse.urlencode({'lang':'en','tag':'python'})
You can also use the furl module https://github.com/gruns/furl
>>> from furl import furl
>>> print furl('http://example.com/search?q=question').add({'lang':'en','tag':'python'}).url
http://example.com/search?q=question&lang=en&tag=python
If you are using the requests lib:
import requests
...
params = {'tag': 'python'}
requests.get(url, params=params)
Based on this answer, one-liner for simple cases (Python 3 code):
from urllib.parse import urlparse, urlencode
url = "https://stackoverflow.com/search?q=question"
params = {'lang':'en','tag':'python'}
url += ('&' if urlparse(url).query else '?') + urlencode(params)
or:
url += ('&', '?')[urlparse(url).query == ''] + urlencode(params)
I find this more elegant than the two top answers:
from urllib.parse import urlencode, urlparse, parse_qs
def merge_url_query_params(url: str, additional_params: dict) -> str:
url_components = urlparse(url)
original_params = parse_qs(url_components.query)
# Before Python 3.5 you could update original_params with
# additional_params, but here all the variables are immutable.
merged_params = {**original_params, **additional_params}
updated_query = urlencode(merged_params, doseq=True)
# _replace() is how you can create a new NamedTuple with a changed field
return url_components._replace(query=updated_query).geturl()
assert merge_url_query_params(
'http://example.com/search?q=question',
{'lang':'en','tag':'python'},
) == 'http://example.com/search?q=question&lang=en&tag=python'
The most important things I dislike in the top answers (they are nevertheless good):
Łukasz: having to remember the index at which the query is in the URL components
Sapphire64: the very verbose way of creating the updated ParseResult
What's bad about my response is the magically looking dict merge using unpacking, but I prefer that to updating an already existing dictionary because of my prejudice against mutability.
Yes: use urllib.
From the examples in the documentation:
>>> import urllib
>>> params = urllib.urlencode({'spam': 1, 'eggs': 2, 'bacon': 0})
>>> f = urllib.urlopen("http://www.musi-cal.com/cgi-bin/query?%s" % params)
>>> print f.geturl() # Prints the final URL with parameters.
>>> print f.read() # Prints the contents
python3, self explanatory I guess
from urllib.parse import urlparse, urlencode, parse_qsl
url = 'https://www.linkedin.com/jobs/search?keywords=engineer'
parsed = urlparse(url)
current_params = dict(parse_qsl(parsed.query))
new_params = {'location': 'United States'}
merged_params = urlencode({**current_params, **new_params})
parsed = parsed._replace(query=merged_params)
print(parsed.geturl())
# https://www.linkedin.com/jobs/search?keywords=engineer&location=United+States
I liked Łukasz version, but since urllib and urllparse functions are somewhat awkward to use in this case, I think it's more straightforward to do something like this:
params = urllib.urlencode(params)
if urlparse.urlparse(url)[4]:
print url + '&' + params
else:
print url + '?' + params
Use the various urlparse functions to tear apart the existing URL, urllib.urlencode() on the combined dictionary, then urlparse.urlunparse() to put it all back together again.
Or just take the result of urllib.urlencode() and concatenate it to the URL appropriately.
Yet another answer:
def addGetParameters(url, newParams):
(scheme, netloc, path, params, query, fragment) = urlparse.urlparse(url)
queryList = urlparse.parse_qsl(query, keep_blank_values=True)
for key in newParams:
queryList.append((key, newParams[key]))
return urlparse.urlunparse((scheme, netloc, path, params, urllib.urlencode(queryList), fragment))
In python 2.5
import cgi
import urllib
import urlparse
def add_url_param(url, **params):
n=3
parts = list(urlparse.urlsplit(url))
d = dict(cgi.parse_qsl(parts[n])) # use cgi.parse_qs for list values
d.update(params)
parts[n]=urllib.urlencode(d)
return urlparse.urlunsplit(parts)
url = "http://stackoverflow.com/search?q=question"
add_url_param(url, lang='en') == "http://stackoverflow.com/search?q=question&lang=en"
Here is how I implemented it.
import urllib
params = urllib.urlencode({'lang':'en','tag':'python'})
url = ''
if request.GET:
url = request.url + '&' + params
else:
url = request.url + '?' + params
Worked like a charm. However, I would have liked a more cleaner way to implement this.
Another way of implementing the above is put it in a method.
import urllib
def add_url_param(request, **params):
new_url = ''
_params = dict(**params)
_params = urllib.urlencode(_params)
if _params:
if request.GET:
new_url = request.url + '&' + _params
else:
new_url = request.url + '?' + _params
else:
new_url = request.url
return new_ur