find dynamic values from url as per regular expression defined - python

I am having two inputs for my task
>>> uri = u'/shop/amazonwow/getstates/1'
>>> uri_regex = u'/shop/(?P<shopid>.+)$/getstates/(?P<countryid>.+)$/'
Here uri is the request url and also i am passing a uri pattern(uri_regex) with it.
I need to fetch all dynamic data from uri .We will decide which data is dynamic as per our uri_regex .
Example : Here uri_regex has shopid , countryid as regular expression pattern and url is having values `amazonwow , 1 at same indexes.
My output will be like :
out = {'shopid': 'amazonwow', 'countryid' :1,}
My Try :
>>> uri_list = uri.split('/')
[u'', u'shop', u'amazonwow', u'getstates', u'1']
>>> regex = uri_regex.split('/')
>>> regex
[u'', u'shop', u'(?P<shopid>.+)$', u'getstates', u'(?P<countryid>.+)$']
>>> out = {}
>>> for i in range(len(regex)):
if regex[i].startswith('(?') & regex[i].endswith(')$'):
key = regex[i][regex[i].find("<")+1:regex[i].find(">")]
out[key] = uri_list[i]
>>> print out
{u'shopid': u'amazonwow', u'countryid': u'1'}
>>>
Note : i tried this but i do not think it is proper solution to above problem. Please guide me if you guys have much better way.

import re
uri = u'/shop/amazonwow/getstates/1'
pattern = re.compile(u'shop/(.+)/getstates/(.+)')
if pattern.search(uri):
out['shopid'] = pattern.search(uri).groups()[0]
out['countryid'] = pattern.search(uri).groups()[1]
Output:
out = {'countryid': '1', 'shopid': 'amazonwow'}

My Try:
def fetch_uri_variables(uri, uri_regex):
"""
function to fetch dynamic variables passed in uri as per
regular expression defined into uri_regex
"""
out, uri_list, uri_regex = {}, uri.split('/'), uri_regex.split('/')
for pattern in range(len(uri_regex)):
if re.search('^(\(\?)(.*)(\)\$)$', uri_regex[pattern]):
out[re.search('\<(.*)\>', uri_regex[pattern]).group(1)] = \
uri_list[pattern]
return out
>>> uri
u'/testing/shop/amazonwow/getstates/1'
>>> uri_regex
u'/(?P<test>.+)$/shop/(?P<shopid>.+)$/getstates/(?P<countryid>.+)$/'
>>> fetch_uri_variables(uri, uri_regex)
{u'test': u'testing', u'countryid': u'1', u'shopid': u'amazonwow'}
>>>

Related

Python - get TLD

I have a problem in function which should remove tld from domain. If domain has some subdomain it works correctly. For example:
Input: asdf.xyz.example.com
Output: asdf.xyz.example
Problem is when the domain has not any subdomain, there is dot in front of domain
Input: example.com
Output: .example
This is my code:
res = get_tld(domain, as_object=True, fail_silently=True, fix_protocol=True)
domain = '.'.join([res.subdomain, res.domain])
Function get_tld is from tld library
Could someone help me how to solve this problem?
With a very simple string manipulation, is this what you are looking for?
d1 = 'asdf.xyz.example.com'
output = '.'.join(d1.split('.')[:-1])
# output = 'asdf.xyz.example'
d2 = 'example.com'
output = '.'.join(d2.split('.')[:-1])
# output = 'example'
You can use filtering. It looks like get_tld works as intended but join is incorrect
domain = '.'.join(filter(lambda x: len(x), [res.subdomain, res.domain]))
another simple version is this:
def remove_tld(url):
*base, tld = url.split(".")
return ".".join(base)
url = "asdf.xyz.example.com"
print(remove_tld(url)) # asdf.xyz.example
url = "example.com"
print(remove_tld(url)) # example
*base, tld = url.split(".") puts the TLD in tld and everything else in base. then you just join tĥat with ".".join(base).

Python regex to find functions and params pairs in js files

I am writing a JavaScript crawler application.
The application needs to open JavaScript files and find some specific code in order to do some stuff with them.
I am using regular expressions to find the code of interest.
Consider the following JavaScript code:
let nlabel = rs.length ? st('string1', [st('string2', ctx = 'ctx2')], ctx = 'ctx1') : st('Found {0}', [st(this.param)]);
As you can see there is the st function which is called three times in the same line. The first two calls have an extra parameter named ctx but the third one doesn't have it.
What I need to do is to have 3 re matches as below:
Match 1
Group: function = "st('"
Group: string = "string1"
Group: ctx = "ctx1"
Match 2
Group: function = "st('"
Group: string = "string2"
Group: ctx = "ctx2"
Match 3
Group: function = "st('"
Group: string = "Found {0}"
Group: ctx = (None)
I am using the regex101.com to test my patterns and the pattern that gives the closest thing to what I am looking for is the following:
(?P<function>st\([\"'])(?P<string>.+?(?=[\"'](\s*,ctx\s*|\s*,\s*)))
You can see it in action here.
However, I have no idea how to make it return the ctx group the way I want it.
For your reference I am using the following Python code:
matches = []
code = "let nlabel = rs.length ? st('string1', [st('string2', ctx = 'ctx2')], ctx = 'ctx1') : st('Found {0}', [st(this.param)], ctx = 'ctxparam'"
pattern = "(?P<function>st\([\"'])(?P<string>.+?(?=[\"'](\s*,ctx\s*|\s*,\s*)))"
for m in re.compile(pattern).finditer(code):
fnc = m.group('function')
msg = m.group('string')
ctx = m.group('ctx')
idx = m.start()
matches.append([idx, fnc, msg, ctx])
print(matches)
I have the feeling that re alone isn't capable to do exactly what I am looking for but any suggestion/solution which gets closer is more than welcome.

Converting part of string into variable name in python

I have a file containing a text like this:
loadbalancer {
upstream application1 {
server 127.0.0.1:8082;
server 127.0.0.1:8083;
server 127.0.0.1:8084;
}
upstream application2 {
server 127.0.0.1:8092;
server 127.0.0.1:8093;
server 127.0.0.1:8094;
}
}
Does anyone know, how could I extract variables like below:
appList=["application1","application2"]
ServerOfapp1=["127.0.0.1:8082","127.0.0.1:8083","127.0.0.1:8084"]
ServerOfapp2=["127.0.0.1:8092","127.0.0.1:8093","127.0.0.1:8094"]
.
.
.
and so on
If the lines you want always start with upstream and server this should work:
app_dic = {}
with open('file.txt','r') as f:
for line in f:
if line.startswith('upstream'):
app_i = line.split()[1]
server_of_app_i = []
for line in f:
if not line.startswith('server'):
break
server_of_app_i.append(line.split()[1][:-1])
app_dic[app_i] = server_of_app_i
app_dic should then be a dictionary of lists:
{'application1': ['127.0.0.1:8082', '127.0.0.1:8083', '127.0.0.1:8084'],
'application2': ['127.0.0.1:8092', '127.0.0.1:8093', '127.0.0.1:8094']}
EDIT
If the input file does not contain any newline character, as long as the file is not too large you could write it to a list and iterate over it:
app_dic = {}
with open('file.txt','r') as f:
txt_iter = iter(f.read().split()) #iterator of list
for word in txt_iter:
if word == 'upstream':
app_i = next(txt_iter)
server_of_app_i=[]
for word in txt_iter:
if word == 'server':
server_of_app_i.append(next(txt_iter)[:-1])
elif word == '}':
break
app_dic[app_i] = server_of_app_i
This is more ugly as one has to search for the closing curly bracket to break. If it gets any more complicated, regex should be used.
If you are able to use the newer regex module by Matthew Barnett, you can use the following solution, see an additional demo on regex101.com:
import regex as re
rx = re.compile(r"""
(?:(?P<application>application\d)\s{\n| # "application" + digit + { + newline
(?!\A)\G\n) # assert that the next match starts here
server\s # match "server"
(?P<server>[\d.:]+); # followed by digits, . and :
""", re.VERBOSE)
string = """
loadbalancer {
upstream application1 {
server 127.0.0.1:8082;
server 127.0.0.1:8083;
server 127.0.0.1:8084;
}
upstream application2 {
server 127.0.0.1:8092;
server 127.0.0.1:8093;
server 127.0.0.1:8094;
}
}
"""
result = {}
for match in rx.finditer(string):
if match.group('application'):
current = match.group('application')
result[current] = list()
if current:
result[current].append(match.group('server'))
print result
# {'application2': ['127.0.0.1:8092', '127.0.0.1:8093', '127.0.0.1:8094'], 'application1': ['127.0.0.1:8082', '127.0.0.1:8083', '127.0.0.1:8084']}
This makes use of the \G modifier, named capture groups and some programming logic.
This is the basic method:
# each of your objects here
objText = "xyz xcyz 244.233.233.2:123"
listOfAll = re.findall(r"/\b(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?):[0-9]{1,5}/g", objText)
for eachMatch in listOfAll:
print "Here's one!" % eachMatch
Obviously that's a bit rough around the edges, but it will perform a full-scale regex search of whatever string it's given. Probably a better solution would be to pass it the objects themselves, but for now I'm not sure what you would have as raw input. I'll try to improve on the regex, though.
I believe this as well can be solved with re:
>>> import re
>>> from collections import defaultdict
>>>
>>> APP = r'\b(?P<APP>application\d+)\b'
>>> IP = r'server\s+(?P<IP>[\d\.:]+);'
>>>
>>> pat = re.compile('|'.join([APP, IP]))
>>>
>>>
>>> scan = pat.scanner(s)
>>> d = defaultdict(list)
>>>
>>> for m in iter(scan.search, None):
group = m.lastgroup
if group == 'APP':
keygroup = m.group(group)
continue
else:
d[keygroup].append(m.group(group))
>>> d
defaultdict(<class 'list'>, {'application1': ['127.0.0.1:8082', '127.0.0.1:8083', '127.0.0.1:8084'], 'application2': ['127.0.0.1:8092', '127.0.0.1:8093', '127.0.0.1:8094']})
Or similarly with re.finditer method and without pat.scanner:
>>> for m in re.finditer(pat, s):
group = m.lastgroup
if group == 'APP':
keygroup = m.group(group)
continue
else:
d[keygroup].append(m.group(group))
>>> d
defaultdict(<class 'list'>, {'application1': ['127.0.0.1:8082', '127.0.0.1:8083', '127.0.0.1:8084'], 'application2': ['127.0.0.1:8092', '127.0.0.1:8093', '127.0.0.1:8094']})

Looking for assertURLEquals

During a unittest I would like to compare a generated URL with a static one defined in the test. For this comparison it would be good to have a TestCase.assertURLEqual or similar which would let you compare two URLs in string format and result in True if all query and fragment components were present and equal but not necessarily in order.
Before I go implement this myself, is this feature around already?
I don't know if there is something built-in, but you could simply use urlparse and check yourself for the query parameters since order is taken into account by default.
>>> import urlparse
>>> url1 = 'http://google.com/?a=1&b=2'
>>> url2 = 'http://google.com/?b=2&a=1'
>>> # parse url ignoring query params order
... def parse_url(url):
... u = urlparse.urlparse(url)
... q = u.query
... u = urlparse.urlparse(u.geturl().replace(q, ''))
... return (u, urlparse.parse_qs(q))
...
>>> parse_url(url1)
(ParseResult(scheme='http', netloc='google.com', path='/', params='', query='', fragment=''), {'a': ['1'], 'b': ['2']})
>>> def assert_url_equals(url1, url2):
... return parse_url(url1) == parse_url(url1)
...
>>> assert_url_equals(url1, url2)
True
Well this is not too hard to implement with urlparse in the standard library:
from urlparse import urlparse, parse_qs
def urlEq(url1, url2):
pr1 = urlparse(url1)
pr2 = urlparse(url2)
return (pr1.scheme == pr2.scheme and
pr1.netloc == pr2.netloc and
pr1.path == pr2.path and
parse_qs(pr1.query) == parse_qs(pr2.query))
# Prints True
print urlEq("http://foo.com/blah?bar=1&foo=2", "http://foo.com/blah?foo=2&bar=1")
# Prints False
print urlEq("http://foo.com/blah?bar=1&foo=2", "http://foo.com/blah?foo=4&bar=1")
Basically, compare everything that is parsed from the URL but use parse_qs to get a dictionary from the query string.

Get subdomain from URL using Python

For example, the address is:
Address = http://lol1.domain.com:8888/some/page
I want to save the subdomain into a variable so i could do like so;
print SubAddr
>> lol1
Package tldextract makes this task very easy, and then you can use urlparse as suggested if you need any further information:
>>> import tldextract
>>> tldextract.extract("http://lol1.domain.com:8888/some/page"
ExtractResult(subdomain='lol1', domain='domain', suffix='com')
>>> tldextract.extract("http://sub.lol1.domain.com:8888/some/page"
ExtractResult(subdomain='sub.lol1', domain='domain', suffix='com')
>>> urlparse.urlparse("http://sub.lol1.domain.com:8888/some/page")
ParseResult(scheme='http', netloc='sub.lol1.domain.com:8888', path='/some/page', params='', query='', fragment='')
Note that tldextract properly handles sub-domains.
urlparse.urlparse will split the URL into protocol, location, port, etc. You can then split the location by . to get the subdomain.
import urlparse
url = urlparse.urlparse(address)
subdomain = url.hostname.split('.')[0]
Modified version of the fantastic answer here: How to extract top-level domain name (TLD) from URL
You will need the list of effective tlds from here
from __future__ import with_statement
from urlparse import urlparse
# load tlds, ignore comments and empty lines:
with open("effective_tld_names.dat.txt") as tldFile:
tlds = [line.strip() for line in tldFile if line[0] not in "/\n"]
class DomainParts(object):
def __init__(self, domain_parts, tld):
self.domain = None
self.subdomains = None
self.tld = tld
if domain_parts:
self.domain = domain_parts[-1]
if len(domain_parts) > 1:
self.subdomains = domain_parts[:-1]
def get_domain_parts(url, tlds):
urlElements = urlparse(url).hostname.split('.')
# urlElements = ["abcde","co","uk"]
for i in range(-len(urlElements),0):
lastIElements = urlElements[i:]
# i=-3: ["abcde","co","uk"]
# i=-2: ["co","uk"]
# i=-1: ["uk"] etc
candidate = ".".join(lastIElements) # abcde.co.uk, co.uk, uk
wildcardCandidate = ".".join(["*"]+lastIElements[1:]) # *.co.uk, *.uk, *
exceptionCandidate = "!"+candidate
# match tlds:
if (exceptionCandidate in tlds):
return ".".join(urlElements[i:])
if (candidate in tlds or wildcardCandidate in tlds):
return DomainParts(urlElements[:i], '.'.join(urlElements[i:]))
# returns ["abcde"]
raise ValueError("Domain not in global list of TLDs")
domain_parts = get_domain_parts("http://sub2.sub1.example.co.uk:80",tlds)
print "Domain:", domain_parts.domain
print "Subdomains:", domain_parts.subdomains or "None"
print "TLD:", domain_parts.tld
Gives you:
Domain: example
Subdomains: ['sub2', 'sub1']
TLD: co.uk
A very basic approach, without any sanity checking could look like:
address = 'http://lol1.domain.com:8888/some/page'
host = address.partition('://')[2]
sub_addr = host.partition('.')[0]
print sub_addr
This of course assumes that when you say 'subdomain' you mean the first part of a host name, so in the following case, 'www' would be the subdomain:
http://www.google.com/
Is that what you mean?
What you are looking for is in:
http://docs.python.org/library/urlparse.html
for example:
".".join(urlparse('http://www.my.cwi.nl:80/%7Eguido/Python.html').netloc.split(".")[:-2])
Will do the job for you (will return "www.my")
For extracting the hostname, I'd use urlparse from urllib2:
>>> from urllib2 import urlparse
>>> a = "http://lol1.domain.com:8888/some/page"
>>> urlparse.urlparse(a).hostname
'lol1.domain.com'
As to how to extract the subdomain, you need to cover for the case that there FQDN could be longer. How you do this would depend on your purposes. I might suggest stripping off the two right most components.
E.g.
>>> urlparse.urlparse(a).hostname.rpartition('.')[0].rpartition('.')[0]
'lol1'
We can use https://github.com/john-kurkowski/tldextract for this problem...
It's easy.
>>> ext = tldextract.extract('http://forums.bbc.co.uk')
>>> (ext.subdomain, ext.domain, ext.suffix)
('forums', 'bbc', 'co.uk')
tldextract separate the TLD from the registered domain and subdomains of a URL.
Installation
pip install tldextract
For the current question:
import tldextract
address = 'http://lol1.domain.com:8888/some/page'
domain = tldextract.extract(address).domain
print("Extracted domain name : ", domain)
The output:
Extracted domain name : domain
In addition, there is a number of examples which is extremely related with the usage of tldextract.extract side.
First of All import tldextract, as this splits the URL into its constituents like: subdomain. domain, and suffix.
import tldextract
Then declare a variable (say ext) that stores the results of the query. We also have to provide it with the URL in parenthesis with double quotes. As shown below:
ext = tldextract.extract("http://lol1.domain.com:8888/some/page")
If we simply try to run ext variable, the output will be:
ExtractResult(subdomain='lol1', domain='domain', suffix='com')
Then if you want to use only subdomain or domain or suffix, then use any of the below code, respectively.
ext.subdomain
The result will be:
'lol1'
ext.domain
The result will be:
'domain'
ext.suffix
The result will be:
'com'
Also, if you want to store the results only of subdomain in a variable, then use the code below:
Sub_Domain = ext.subdomain
Then Print Sub_Domain
Sub_Domain
The result will be:
'lol1'
Using python 3 (I'm using 3.9 to be specific), you can do the following:
from urllib.parse import urlparse
address = 'http://lol1.domain.com:8888/some/page'
url = urlparse(address)
url.hostname.split('.')[0]
import re
def extract_domain(domain):
domain = re.sub('http(s)?://|(\:|/)(.*)|','', domain)
matches = re.findall("([a-z0-9][a-z0-9\-]{1,63}\.[a-z\.]{2,6})$", domain)
if matches:
return matches[0]
else:
return domain
def extract_subdomains(domain):
subdomains = domain = re.sub('http(s)?://|(\:|/)(.*)|','', domain)
domain = extract_domain(subdomains)
subdomains = re.sub('\.?'+domain,'', subdomains)
return subdomains
Example to fetch subdomains:
print(extract_subdomains('http://lol1.domain.com:8888/some/page'))
print(extract_subdomains('kota-tangerang.kpu.go.id'))
Outputs:
lol1
kota-tangerang
Example to fetch domain
print(extract_domain('http://lol1.domain.com:8888/some/page'))
print(extract_domain('kota-tangerang.kpu.go.id'))
Outputs:
domain.com
kpu.go.id
Standardize all domains to start with www. unless they have a subdomain.
from urllib.parse import urlparse
def has_subdomain(url):
if len(url.split('.')) > 2:
return True
else:
return False
domain = urlparse(url).netloc
if not has_subdomain(url):
domain_name = 'www.' + domain
url = urlparse(url).scheme + '://' + domain

Categories