Confused about a simple regex with sub

Confused about a simple regex with sub - python

I have following python code:
TRAC_REQUEST_LOCATION=""
TRAC_ENV=TRAC_ENV_PARENT+"/"+re.sub(r'^'+TRAC_REQUEST_LOCATION+'/([^/]+).*', r'\1', environ['REQUEST_URI'])
The content of environ['REQUEST_URI'] is something like that /abc/DEF and I want to get only abc, but it doesn't work. Only sometimes it works, but why?
Thanks for any advices.
EDIT:
Here is the new code consisting on the given answers:
def check_password(environ, user, password):
global acct_mgr, TRAC_ENV
TRAC_ENV = ''
if 'REQUEST_URI' in environ:
if '/' in environ['REQUEST_URI']:
TRAC_ENV = environ['REQUEST_URI'].split('/')[1]
else:
return None
But I get as TRAC_ENV things like /abc/ or /abc, but I need only the abc part.
What is wrong with the code?

Why do you need a regexp? Use urlparse (Python 2.x, there is a link for Python 3.x in there).

If you want to extract the first part of the request path this is the simplest solution:
TRAC_ENV = ''
if '/' in environ['REQUEST_URI']:
TRAC_ENV = environ['REQUEST_URI'].split('/')[1]
EDIT
An example usage:
>>> def trac_env(environ):
... trac_env = ''
... if '/' in environ['REQUEST_URI']:
... trac_env = environ['REQUEST_URI'].split('/')[1]
... return trac_env
...
>>> trac_env({'REQUEST_URI': ''})
''
>>> trac_env({'REQUEST_URI': '/'})
''
>>> trac_env({'REQUEST_URI': '/foo'})
'foo'
>>> trac_env({'REQUEST_URI': '/foo/'})
'foo'
>>> trac_env({'REQUEST_URI': '/foo/bar'})
'foo'
>>> trac_env({'REQUEST_URI': '/foo/bar/'})
'foo'

I am back to my code above and it works fine now.
Perhaps the update of the components was the solution.

Related

Using regex on Charfield in Django

I have a model with
class dbf_att(models.Model):
name = models.CharField(max_length=50, null=True)
And i'd like to check later that object.name match some regex:
if re.compile('^\d+$').match(att.name):
ret = 'Integer'
elif re.compile('^\d+\.\d+$').match(att.name):
ret = 'Float'
else:
ret = 'String'
return ret
This always return 'String' when some of the att.name should match those regex.
Thanks!

You can try with RegexValidator
Or you can to it with package django-regex-field, but i would rather recommand you to use built-in solution, the less third-party-apps the better.

Regex are great, but sometimes it is more simpler and readable to use other approaches. For example, How about just using builtin types to check for the type
try:
att_name = float(att.name)
ret = "Integer" if att_name.is_integer() else "Float"
except ValueError:
ret = "String"
FYI, your regex code works perfectly fine. You might want to inspect the data that is being checked.
Demo:
>>> import re
>>> a = re.compile('^\d+$')
>>> b = re.compile('^\d+\.\d+$')
>>> a.match('10')
<_sre.SRE_Match object at 0x10fe7eb28>
>>> a.match('10.94')
>>> b.match('10')
>>> b.match('10.94')
<_sre.SRE_Match object at 0x10fe7eb90>
>>> a.match("string")
>>> b.match("string")

Why lxml.etree.SubElement(body, "br") will create <br />?

I'm going through the lxml tutorial and I have a question:
Here is the code:
>>> html = etree.Element("html")
>>> body = etree.SubElement(html, "body")
>>> body.text = "TEXT"
>>> etree.tostring(html)
b'<html><body>TEXT</body></html>'
#############LOOK!!!!!!!############
>>> br = etree.SubElement(body, "br")
>>> etree.tostring(html)
b'<html><body>TEXT<br/></body></html>'
#############END####################
>>> br.tail = "TAIL"
>>> etree.tostring(html)
b'<html><body>TEXT<br/>TAIL</body></html>'
As you can see, in the wrapped block, the instruction br = etree.SubElement(body, "br") will only create a <br /> mark, and why is that?
Is br a reserved word?

Thanks to someone's kindly notification, I should publish my answer here:
Look at this code first:
from lxml import etree
if __name__ == '__main__':
print """Trying to create xml file like this:
<html><body>Hello<br/>World</body></html>"""
html_node = etree.Element("html")
body_node = etree.SubElement(html_node, "body")
body_node.text = "Hello"
print "Step1:" + etree.tostring(html_node)
br_node = etree.SubElement(body_node, "br")
print "Step2:" + etree.tostring(html_node)
br_node.tail = "World"
print "Step3:" + etree.tostring(html_node)
br_node.text = "Yeah?"
print "Step4:" + etree.tostring(html_node)
Here is the output:
Trying to create xml file like this:
<html><body>Hello<br/>World</body></html>
Step1:<html><body>Hello</body></html>
Step2:<html><body>Hello<br/></body></html>
Step3:<html><body>Hello<br/>World</body></html>
Step4:<html><body>Hello<br>Yeah?</br>World</body></html>
At first, what I was trying to figure out is:
Why the output of br_node is rather than
You may check the step3 and step4, and the answer is quite clear:
If the element has no content, it's output format would be <"name"/>
Due to the existing semantic of , this easy question confused me for a long time.
Hope this post will help some guys like me.

Looking for assertURLEquals

During a unittest I would like to compare a generated URL with a static one defined in the test. For this comparison it would be good to have a TestCase.assertURLEqual or similar which would let you compare two URLs in string format and result in True if all query and fragment components were present and equal but not necessarily in order.
Before I go implement this myself, is this feature around already?

I don't know if there is something built-in, but you could simply use urlparse and check yourself for the query parameters since order is taken into account by default.
>>> import urlparse
>>> url1 = 'http://google.com/?a=1&b=2'
>>> url2 = 'http://google.com/?b=2&a=1'
>>> # parse url ignoring query params order
... def parse_url(url):
... u = urlparse.urlparse(url)
... q = u.query
... u = urlparse.urlparse(u.geturl().replace(q, ''))
... return (u, urlparse.parse_qs(q))
...
>>> parse_url(url1)
(ParseResult(scheme='http', netloc='google.com', path='/', params='', query='', fragment=''), {'a': ['1'], 'b': ['2']})
>>> def assert_url_equals(url1, url2):
... return parse_url(url1) == parse_url(url1)
...
>>> assert_url_equals(url1, url2)
True

Well this is not too hard to implement with urlparse in the standard library:
from urlparse import urlparse, parse_qs
def urlEq(url1, url2):
pr1 = urlparse(url1)
pr2 = urlparse(url2)
return (pr1.scheme == pr2.scheme and
pr1.netloc == pr2.netloc and
pr1.path == pr2.path and
parse_qs(pr1.query) == parse_qs(pr2.query))
# Prints True
print urlEq("http://foo.com/blah?bar=1&foo=2", "http://foo.com/blah?foo=2&bar=1")
# Prints False
print urlEq("http://foo.com/blah?bar=1&foo=2", "http://foo.com/blah?foo=4&bar=1")
Basically, compare everything that is parsed from the URL but use parse_qs to get a dictionary from the query string.

Get subdomain from URL using Python

For example, the address is:
Address = http://lol1.domain.com:8888/some/page
I want to save the subdomain into a variable so i could do like so;
print SubAddr
>> lol1

Package tldextract makes this task very easy, and then you can use urlparse as suggested if you need any further information:
>>> import tldextract
>>> tldextract.extract("http://lol1.domain.com:8888/some/page"
ExtractResult(subdomain='lol1', domain='domain', suffix='com')
>>> tldextract.extract("http://sub.lol1.domain.com:8888/some/page"
ExtractResult(subdomain='sub.lol1', domain='domain', suffix='com')
>>> urlparse.urlparse("http://sub.lol1.domain.com:8888/some/page")
ParseResult(scheme='http', netloc='sub.lol1.domain.com:8888', path='/some/page', params='', query='', fragment='')
Note that tldextract properly handles sub-domains.

urlparse.urlparse will split the URL into protocol, location, port, etc. You can then split the location by . to get the subdomain.
import urlparse
url = urlparse.urlparse(address)
subdomain = url.hostname.split('.')[0]

Modified version of the fantastic answer here: How to extract top-level domain name (TLD) from URL
You will need the list of effective tlds from here
from __future__ import with_statement
from urlparse import urlparse
# load tlds, ignore comments and empty lines:
with open("effective_tld_names.dat.txt") as tldFile:
tlds = [line.strip() for line in tldFile if line[0] not in "/\n"]
class DomainParts(object):
def __init__(self, domain_parts, tld):
self.domain = None
self.subdomains = None
self.tld = tld
if domain_parts:
self.domain = domain_parts[-1]
if len(domain_parts) > 1:
self.subdomains = domain_parts[:-1]
def get_domain_parts(url, tlds):
urlElements = urlparse(url).hostname.split('.')
# urlElements = ["abcde","co","uk"]
for i in range(-len(urlElements),0):
lastIElements = urlElements[i:]
# i=-3: ["abcde","co","uk"]
# i=-2: ["co","uk"]
# i=-1: ["uk"] etc
candidate = ".".join(lastIElements) # abcde.co.uk, co.uk, uk
wildcardCandidate = ".".join(["*"]+lastIElements[1:]) # *.co.uk, *.uk, *
exceptionCandidate = "!"+candidate
# match tlds:
if (exceptionCandidate in tlds):
return ".".join(urlElements[i:])
if (candidate in tlds or wildcardCandidate in tlds):
return DomainParts(urlElements[:i], '.'.join(urlElements[i:]))
# returns ["abcde"]
raise ValueError("Domain not in global list of TLDs")
domain_parts = get_domain_parts("http://sub2.sub1.example.co.uk:80",tlds)
print "Domain:", domain_parts.domain
print "Subdomains:", domain_parts.subdomains or "None"
print "TLD:", domain_parts.tld
Gives you:
Domain: example
Subdomains: ['sub2', 'sub1']
TLD: co.uk

A very basic approach, without any sanity checking could look like:
address = 'http://lol1.domain.com:8888/some/page'
host = address.partition('://')[2]
sub_addr = host.partition('.')[0]
print sub_addr
This of course assumes that when you say 'subdomain' you mean the first part of a host name, so in the following case, 'www' would be the subdomain:
http://www.google.com/
Is that what you mean?

What you are looking for is in:
http://docs.python.org/library/urlparse.html
for example:
".".join(urlparse('http://www.my.cwi.nl:80/%7Eguido/Python.html').netloc.split(".")[:-2])
Will do the job for you (will return "www.my")

For extracting the hostname, I'd use urlparse from urllib2:
>>> from urllib2 import urlparse
>>> a = "http://lol1.domain.com:8888/some/page"
>>> urlparse.urlparse(a).hostname
'lol1.domain.com'
As to how to extract the subdomain, you need to cover for the case that there FQDN could be longer. How you do this would depend on your purposes. I might suggest stripping off the two right most components.
E.g.
>>> urlparse.urlparse(a).hostname.rpartition('.')[0].rpartition('.')[0]
'lol1'

We can use https://github.com/john-kurkowski/tldextract for this problem...
It's easy.
>>> ext = tldextract.extract('http://forums.bbc.co.uk')
>>> (ext.subdomain, ext.domain, ext.suffix)
('forums', 'bbc', 'co.uk')

tldextract separate the TLD from the registered domain and subdomains of a URL.
Installation
pip install tldextract
For the current question:
import tldextract
address = 'http://lol1.domain.com:8888/some/page'
domain = tldextract.extract(address).domain
print("Extracted domain name : ", domain)
The output:
Extracted domain name : domain
In addition, there is a number of examples which is extremely related with the usage of tldextract.extract side.

First of All import tldextract, as this splits the URL into its constituents like: subdomain. domain, and suffix.
import tldextract
Then declare a variable (say ext) that stores the results of the query. We also have to provide it with the URL in parenthesis with double quotes. As shown below:
ext = tldextract.extract("http://lol1.domain.com:8888/some/page")
If we simply try to run ext variable, the output will be:
ExtractResult(subdomain='lol1', domain='domain', suffix='com')
Then if you want to use only subdomain or domain or suffix, then use any of the below code, respectively.
ext.subdomain
The result will be:
'lol1'
ext.domain
The result will be:
'domain'
ext.suffix
The result will be:
'com'
Also, if you want to store the results only of subdomain in a variable, then use the code below:
Sub_Domain = ext.subdomain
Then Print Sub_Domain
Sub_Domain
The result will be:
'lol1'

Using python 3 (I'm using 3.9 to be specific), you can do the following:
from urllib.parse import urlparse
address = 'http://lol1.domain.com:8888/some/page'
url = urlparse(address)
url.hostname.split('.')[0]

import re
def extract_domain(domain):
domain = re.sub('http(s)?://|(\:|/)(.*)|','', domain)
matches = re.findall("([a-z0-9][a-z0-9\-]{1,63}\.[a-z\.]{2,6})$", domain)
if matches:
return matches[0]
else:
return domain
def extract_subdomains(domain):
subdomains = domain = re.sub('http(s)?://|(\:|/)(.*)|','', domain)
domain = extract_domain(subdomains)
subdomains = re.sub('\.?'+domain,'', subdomains)
return subdomains
Example to fetch subdomains:
print(extract_subdomains('http://lol1.domain.com:8888/some/page'))
print(extract_subdomains('kota-tangerang.kpu.go.id'))
Outputs:
lol1
kota-tangerang
Example to fetch domain
print(extract_domain('http://lol1.domain.com:8888/some/page'))
print(extract_domain('kota-tangerang.kpu.go.id'))
Outputs:
domain.com
kpu.go.id

Standardize all domains to start with www. unless they have a subdomain.
from urllib.parse import urlparse
def has_subdomain(url):
if len(url.split('.')) > 2:
return True
else:
return False
domain = urlparse(url).netloc
if not has_subdomain(url):
domain_name = 'www.' + domain
url = urlparse(url).scheme + '://' + domain

Access untranslated content of Django's ugettext_lazy

I'm looking for a sane way to get to the untranslated content of a ugettext_lazyied string. I found two ways, but I'm not happy with either one:
the_string = ugettext_lazy('the content')
the_content = the_string._proxy____args[0] # ewww!
or
from django.utils.translation import activate, get_language
from django.utils.encoding import force_unicode
the_string = ugettext_lazy('the content')
current_lang = get_language()
activate('en')
the_content = force_unicode(the_string)
activate(current_lang)
The first piece of code accesses an attribute that has been explicitly marked as private, so there is no telling how long this code will work. The second solution is overly verbose and slow.
Of course, in the actual code, the definition of the ugettext_lazyied string and the code that accesses it are miles appart.

This is the better version of your second solution
from django.utils import translation
the_string = ugettext_lazy('the content')
with translation.override('en'):
content = unicode(the_string)

Another two options. Not very elegant, but not private api and is not slow.
Number one, define your own ugettext_lazy:
from django.utils import translation
def ugettext_lazy(str):
t = translation.ugettext_lazy(str)
t.message = str
return t
>>> text = ugettext_lazy('Yes')
>>> text.message
"Yes"
>>> activate('lt')
>>> unicode(text)
u"Taip"
>>> activate('en')
>>>> unicode(text)
u"Yes"
Number two: redesign your code. Define untranslated messages separately from where you use them:
gettext = lambda s: s
some_text = gettext('Some text')
lazy_translated = ugettext_lazy(text)
untranslated = some_text

You can do that (but you shouldn't):
the_string = ugettext_lazy('the content')
the_string._proxy____args[0]

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Confused about a simple regex with sub - python

Why do you need a regexp? Use urlparse (Python 2.x, there is a link for Python 3.x in there).

I am back to my code above and it works fine now. Perhaps the update of the components was the solution.

Related

Using regex on Charfield in Django

Why lxml.etree.SubElement(body, "br") will create <br />?

Looking for assertURLEquals

Get subdomain from URL using Python

Access untranslated content of Django's ugettext_lazy

Categories

Resources