I'm creating a webapplication in Python (and Flask) where a user can register with their wanted username. I would like to show their profile at /user/ and have a directory on the server for each user.
What is the best way to make sure the username is secure for both a url and directory? I read about people using the urlsafe methods in base64, but I would like to have a string that is related to their username for easy recognition.
The generic term for such URL-safe values is "slug", and the process of generating one is called "slugification", or "to slugify". People generally use a regular expression to do so; here is one (sourced from this article on the subject), using only stdlib modules:
import re
from unicodedata import normalize
_punct_re = re.compile(r'[\t !"#$%&\'()*\-/<=>?#\[\\\]^_`{|},.]+')
def slugify(text, delim=u'-'):
"""Generates an slightly worse ASCII-only slug."""
result = []
for word in _punct_re.split(text.lower()):
word = normalize('NFKD', word).encode('ascii', 'ignore')
if word:
result.append(word)
return unicode(delim.join(result))
The linked article has another 2 alternatives requiring additional modules.
Related
I am working on a huge email-address dataset in Python and need to retrieve the organization name.
For example, email#organizationName.com is easy to extract, but what about email#info.organizationName.com or even email#organizationName.co.uk?
I need a universal extractor that should be able to handle all different possibilities accordingly.
If organisationName is always before .com or other ending - this may work -
email_str.split('#')[1].split('.')[-2]
A regex won't work well here. In order to be able to reliably do this, you need to use a lib that has knowledge on what constitutes a valid suffix.
Otherwise, how would the extractor be able distinguish email#info.organizationName.com from email#organizationName.co.uk?
This can be done using tldextract:
Example:
import tldextract
emails = ['email#organizationName.com',
'email#info.organizationName.com',
'email#organizationName.co.uk',
'email#info.organizationName.co.uk',
]
for addr in emails:
print(tldextract.extract(addr))
Output:
ExtractResult(subdomain='', domain='organizationName', suffix='com')
ExtractResult(subdomain='info', domain='organizationName', suffix='com')
ExtractResult(subdomain='', domain='organizationName', suffix='co.uk')
ExtractResult(subdomain='info', domain='organizationName', suffix='co.uk')
To access just the domain, use tldextract.extract(addr).domain.
I would like to use python to convert all synonyms and plural forms of words to the base version of the word.
e.g. Babies would become baby and so would infant and infants.
I tried creating a naive version of plural to root code but it has the issue that it doesn't always function correctly and can't detect a large amount of cases.
contents = ["buying", "stalls", "responsibilities"]
for token in contents:
if token.endswith("ies"):
token = token.replace('ies','y')
elif token.endswith('s'):
token = token[:-1]
elif token.endswith("ed"):
token = token[:-2]
elif token.endswith("ing"):
token = token[:-3]
print(contents)
I have not used this library before, so that this with a grain of salt. However, NodeBox Linguistics seems to be a reasonable set of scripts that will do exactly what you are looking for if you are on MacOS. Check the link here: https://www.nodebox.net/code/index.php/Linguistics
Based on their documentation, it looks like you will be able to use lines like so:
print( en.noun.singular("people") )
>>> person
print( en.verb.infinitive("swimming") )
>>> swim
etc.
In addition to the example above, another to consider is a natural language processing library like NLTK. The reason why I recommend using an external library is because English has a lot of exceptions. As mentioned in my comment, consider words like: class, fling, red, geese, etc., which would trip up the rules that was mentioned in the original question.
I build a python library - Plurals and Countable, which is open source on github. The main purpose is to get plurals (yes, mutliple plurals for some words), but it also solves this particular problem.
import plurals_counterable as pluc
pluc.pluc_lookup_plurals('men', strict_level='dictionary')
will return a dictionary of the following.
{
'query': 'men',
'base': 'man',
'plural': ['men'],
'countable': 'countable'
}
The base field is what you need.
The library actually looks up the words in dictionaries, so it takes some time to request, parse and return. Alternatively, you might use REST API provided by Dictionary.video. You'll need contact admin#dictionary.video to get an API key. The call will be like
import requests
import json
import logging
url = 'https://dictionary.video/api/noun/plurals/men?key=YOUR_API_KEY'
response = requests.get(url)
if response.status_code == 200:
return json.loads(response.text)['base']
else:
logging.error(url + ' response: status_code[%d]' % response.status_code)
return None
I have an application in which the main strings are in English and then various translations are made in various .po/.mo files, as usual (using Flask and Flask-Babel). Is it possible to get a list of all the English strings somewhere within my Python code? Specifically, I'd like to have an admin interface on the website which lets someone log in and choose an arbitrary phrase to be used in a certain place without having to poke at actual Python code or .po/.mo files. This phrase might change over time but needs to be translated, so it needs to be something Babel knows about.
I do have access to the actual .pot file, so I could just parse that, but I was hoping for a cleaner method if possible.
You can use polib for this.
This section of the documentation shows examples of how to iterate over the contents of a .po file. Here is one taken from that page:
import polib
po = polib.pofile('path/to/catalog.po')
for entry in po:
print entry.msgid, entry.msgstr
If you alredy use babel you can get all items from po file:
from babel.messages.pofile import read_po
catalog = read_po(open(full_file_name))
for message in catalog:
print message.id, message.string
See http://babel.edgewall.org/browser/trunk/babel/messages/pofile.py.
You alredy can try get items from mo file:
from babel.messages.mofile import read_mo
catalog = read_po(open(full_file_name))
for message in catalog:
print message.id, message.string
But when I try use it last time it's not was availible. See http://babel.edgewall.org/browser/trunk/babel/messages/mofile.py.
You can use polib as #Miguel wrote.
I'm writing a registration form that only needs to accept the local component of a desired email address. The domain component is fixed to the site. I am attempting to validate it by selectively copying from validators.validate_email which Django provides for EmailField:
email_re = re.compile(
r"(^[-!#$%&'*+/=?^_`{}|~0-9A-Z]+(\.[-!#$%&'*+/=?^_`{}|~0-9A-Z]+)*" # dot-atom
# quoted-string, see also http://tools.ietf.org/html/rfc2822#section-3.2.5
r'|^"([\001-\010\013\014\016-\037!#-\[\]-\177]|\\[\001-\011\013\014\016-\177])*"'
r')#((?:[A-Z0-9](?:[A-Z0-9-]{0,61}[A-Z0-9])?\.)+[A-Z]{2,6}\.?$)' # domain
r'|\[(25[0-5]|2[0-4]\d|[0-1]?\d?\d)(\.(25[0-5]|2[0-4]\d|[0-1]?\d?\d)){3}\]$', re.IGNORECASE) # literal form, ipv4 address (SMTP 4.1.3)
validate_email = EmailValidator(email_re, _(u'Enter a valid e-mail address.'), 'invalid')
Following is my code. My main issue is that I'm unable to adapt the regex. At this point I'm only testing it in a regex tester at http://www.pythonregex.com/ however it's failing:
^([-!#$%&'*+/=?^_`{}|~0-9A-Z]+(\.[-!#$%&'*+/=?^_`{}|~0-9A-Z]+)*)$
This seems to be passing undesirable characters such as ?
The entire code for my Field, which is not necessarily relevant at this stage but I wouldn't mind some comment on it would be:
class LocalEmailField(CharField):
email_local_re = re.compile(r"^([-!#$%&'*+/=?^_`{}|~0-9A-Z]+(\.[-!#$%&'*+/=?^_`{}|~0-9A-Z]+)*)$", re.IGNORECASE)
validate_email_local = RegexValidator(email_re, (u'Enter a valid e-mail username.'), 'invalid')
default_validators = [validate_email_local]
EDIT: To clarify, the user is only entering the text BEFORE the #, hence why I have no need to validate the #domain.com in the validator.
EDIT 2: So the form field and label will look like this:
Desired Email Address: [---type-able area---] #domain.com
You say "undesirable characters such as ?", but I think you're mistaken about what characters are desirable. The original regex allows question marks.
Note that you can also define your own validator that doesn't use a massive regex, and have some chance of decoding the logic later.
Some people, when confronted with a problem, think, “I know, I’ll use
regular expressions.” Now they have two problems. - Jamie
Zawinski
Checking via regex is an exercise in wasting your time. The best way is to attempt delivery; this way not only can you verify the email address, but also if the mailbox is actually active and can receive emails.
Otherwise you'll end up in an every-expanding regular expression that can't possibly hope to match all the rules.
"Haha boo hoo woo woo!"#foo.com is a valid address, so is qwerterukeriouo#gmail.com
Instead, offer the almost-standard "Please click on the link in the email we sent to blahblah#goo.com to verify your address." approach.
If you want to create email addresses, then you can write your own rules on what can be a part of the email component; and they can be a subset of the official allowed chars in the RFC.
For example, a conservative rule (that doesn't use regular expressions):
allowed_chars = [string.digits+string.letters+'-']
if len([x in user_input if x not in allowed_chars]):
print 'Sorry, invalid characters'
else:
if user_input[0] in string.digits+'-':
print 'Cannot start with a number or `-`'
else:
if check_if_already_exists(user_input):
print 'Sorry, already taken'
else:
print 'Congratulations!'
I'm still new to Django and Python, but why reinvent the wheel and maintain your own regex? If, apart from wanting users to enter only the local portion of their email address, you're happy with Django's built-in EmailField, you can subclass it quite easily and tweak the validation logic a bit:
DOMAIN_NAME = u'foo.com'
class LocalEmailField(models.EmailField):
def clean(local_part):
whole_address = '%s#%s' % (local_part, DOMAIN_NAME)
clean_address = super(LocalEmailField, self).clean(whole_address)
# Can do more checking here if necessary
clean_local, at_sign, clean_domain = clean_address.rpartition('#')
return clean_local
Have you looked at the documentation for Form and Field Validation and the .clean() method?
If you want to do it 100% correctly with regex, you need to use an engine with some form of extended regex which allow matching nested parentheses.
Python's default engine does not allow this, so you're better off compromising with a very simple (permissive) regex.
I'm looking to create a search function for my flash game website.
One of the problems with the site is that it is difficult to find a specific game you want, as users must go to the alphabetical list to find one they want.
It's run with Google App Engine written in python, using the webapp framework.
At the very least I need a simple way to search games by their name. It might be easier to do searching in Javascript from the looks of it. I would prefer an autocomplete functionality. I've tried to figure out how to go about this and it seems that the only way is to create a huge index with each name broken up into various stages of being typed ("S", "Sh", "Sho" ... "Shopping Cart Hero").
Is there anyway to do this simply and easily? I'm beginning to think I'll have to create a web service on a PHP+MySql server and search using it.
I have written the code below to handle this. Basically, I save all the possible word "starts" in a list instead of whole sentences. That's how the jquery autocomplete of this site works.
import unicodedata
import re
splitter = re.compile(r'[\s|\-|\)|\(|/]+')
def remove_accents(text):
nkfd_form = unicodedata.normalize('NFKD', unicode(text))
return u"".join([c for c in nkfd_form if not unicodedata.combining(c)])
def get_words(text):
return [s.lower() for s in splitter.split(remove_accents(text)) if s!= '']
def get_unique_words(text):
word_set = set(get_words(text))
return word_set
def get_starts(text):
word_set = get_unique_words(text)
starts = set()
for word in word_set:
for i in range(len(word)):
starts.add(word[:i+1])
return sorted(starts)
Have you looked at gae-search? I believe the Django + jQuery "autocomplete" feature is not part of the free version (it's just in the for-pay premium version), but maybe it's worth a little money to you.