I'm writing a registration form that only needs to accept the local component of a desired email address. The domain component is fixed to the site. I am attempting to validate it by selectively copying from validators.validate_email which Django provides for EmailField:
email_re = re.compile(
r"(^[-!#$%&'*+/=?^_`{}|~0-9A-Z]+(\.[-!#$%&'*+/=?^_`{}|~0-9A-Z]+)*" # dot-atom
# quoted-string, see also http://tools.ietf.org/html/rfc2822#section-3.2.5
r'|^"([\001-\010\013\014\016-\037!#-\[\]-\177]|\\[\001-\011\013\014\016-\177])*"'
r')#((?:[A-Z0-9](?:[A-Z0-9-]{0,61}[A-Z0-9])?\.)+[A-Z]{2,6}\.?$)' # domain
r'|\[(25[0-5]|2[0-4]\d|[0-1]?\d?\d)(\.(25[0-5]|2[0-4]\d|[0-1]?\d?\d)){3}\]$', re.IGNORECASE) # literal form, ipv4 address (SMTP 4.1.3)
validate_email = EmailValidator(email_re, _(u'Enter a valid e-mail address.'), 'invalid')
Following is my code. My main issue is that I'm unable to adapt the regex. At this point I'm only testing it in a regex tester at http://www.pythonregex.com/ however it's failing:
^([-!#$%&'*+/=?^_`{}|~0-9A-Z]+(\.[-!#$%&'*+/=?^_`{}|~0-9A-Z]+)*)$
This seems to be passing undesirable characters such as ?
The entire code for my Field, which is not necessarily relevant at this stage but I wouldn't mind some comment on it would be:
class LocalEmailField(CharField):
email_local_re = re.compile(r"^([-!#$%&'*+/=?^_`{}|~0-9A-Z]+(\.[-!#$%&'*+/=?^_`{}|~0-9A-Z]+)*)$", re.IGNORECASE)
validate_email_local = RegexValidator(email_re, (u'Enter a valid e-mail username.'), 'invalid')
default_validators = [validate_email_local]
EDIT: To clarify, the user is only entering the text BEFORE the #, hence why I have no need to validate the #domain.com in the validator.
EDIT 2: So the form field and label will look like this:
Desired Email Address: [---type-able area---] #domain.com
You say "undesirable characters such as ?", but I think you're mistaken about what characters are desirable. The original regex allows question marks.
Note that you can also define your own validator that doesn't use a massive regex, and have some chance of decoding the logic later.
Some people, when confronted with a problem, think, “I know, I’ll use
regular expressions.” Now they have two problems. - Jamie
Zawinski
Checking via regex is an exercise in wasting your time. The best way is to attempt delivery; this way not only can you verify the email address, but also if the mailbox is actually active and can receive emails.
Otherwise you'll end up in an every-expanding regular expression that can't possibly hope to match all the rules.
"Haha boo hoo woo woo!"#foo.com is a valid address, so is qwerterukeriouo#gmail.com
Instead, offer the almost-standard "Please click on the link in the email we sent to blahblah#goo.com to verify your address." approach.
If you want to create email addresses, then you can write your own rules on what can be a part of the email component; and they can be a subset of the official allowed chars in the RFC.
For example, a conservative rule (that doesn't use regular expressions):
allowed_chars = [string.digits+string.letters+'-']
if len([x in user_input if x not in allowed_chars]):
print 'Sorry, invalid characters'
else:
if user_input[0] in string.digits+'-':
print 'Cannot start with a number or `-`'
else:
if check_if_already_exists(user_input):
print 'Sorry, already taken'
else:
print 'Congratulations!'
I'm still new to Django and Python, but why reinvent the wheel and maintain your own regex? If, apart from wanting users to enter only the local portion of their email address, you're happy with Django's built-in EmailField, you can subclass it quite easily and tweak the validation logic a bit:
DOMAIN_NAME = u'foo.com'
class LocalEmailField(models.EmailField):
def clean(local_part):
whole_address = '%s#%s' % (local_part, DOMAIN_NAME)
clean_address = super(LocalEmailField, self).clean(whole_address)
# Can do more checking here if necessary
clean_local, at_sign, clean_domain = clean_address.rpartition('#')
return clean_local
Have you looked at the documentation for Form and Field Validation and the .clean() method?
If you want to do it 100% correctly with regex, you need to use an engine with some form of extended regex which allow matching nested parentheses.
Python's default engine does not allow this, so you're better off compromising with a very simple (permissive) regex.
Related
I'm trying to create a custom column type using django_tables2 so that I can render contact details as a mailto: link when the result is a valid email address, and just standard text otherwise.
The issue that I'm having is that my value seems to be returned as iterated characters, and as per the code below, the first character of the email address is render as part of mailto: whilst the second character of the email address is rendered in the column. Aside from validate_email I have tried if "#" in and regex, all returning the same iterated character results.
class ContactColumn(tables.Column):
def render(self,value):
try:
validate_email(value)
return format_html('''{}''',*value)
except ValidationError:
return value
Can anyone point me in the right direction as to how to successfully render either a mailto: link or just standard text based on valid email address? Any help is much appreciated!
The problem here is your *value argument.
The asterisk means to unpack a sequence (here, a string) into its parts (characters) and use those for the arguments. (Search for "Python argument unpacking" to learn more.)
Instead, just do:
format_html('''{}''', value, value)
How can I validate a phone number and email in AWS Lex code hook (in Lambda).
I had tried using the following code to validate the phone number and email address in AWS Lex chatbot. I am getting errors.
import re
EMAIL_REGEX = re.compile(r"[^#]+#[^#]+\.[^#]+")
if len(str(phonenumber)) <= 10 or len(str(phonenumber)) >= 10:
return build_validation_result(False,
'PhoneNumber',
'Please enter valid phone number which contains 10 digits'
)
if not EMAIL_REGEX.match(email):
return build_validation_result(False,
'Email',
'Please enter valid email address'
)
Firstly, you will want to fix some of your formatting. Following the guide here will serve you well both to improve the readability of your code for yourself and others who you want help from or who need to maintain code later on.
Secondly, I am assuming you are omitting the vast majority of your code here, and that some of the errors in your indenting come from issues pasting to stackoverflow. I have fixed these errors, but if you are missing other important information regarding interacting with the aws api no one can help you until you post the code and ideally a full traceback of your error.
Not everyone might agree with me on this, but unless you are an expert with regular expressions, it is generally best to copy regex made by gurus and test it thoroughly to verify it produces your desired result rather than making one yourself. The regex I am using below was copied from here. I have tested it with a long list of valid emails I have and not one of them failed to match.
import re
PHONE_REGEX = re.compile(r'[0-9]{10}')
EMAIL_REGEX = re.compile(r"""(?:[a-z0-9!#$%&'*+\/=?^_`{|}~-]+(?:\.[a-z0-9!#'$"""+
r"""%&*+\/=?^_`{|}~-]+)*|"(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d"""+
r"""-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])*")#(?:(?:[a-z0-9](?:[a-z0-9-]*"""+
r"""[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?|\[(?:(?:25[0-5]|2[0-4]["""+
r"""0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?|["""+
r"""a-z0-9-]*[a-z0-9]:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]|"""+
r"""\\[\x01-\x09\x0b\x0c\x0e-\x7f])+)\])""")
if not PHONE_REGEX.match(phonenumber):
return build_validation_result(
False,
'PhoneNumber',
'Please enter valid phone number which contains 10 digits'
)
if not EMAIL_REGEX.match(email):
return build_validation_result(
False,
'Email',
'Please enter valid email address'
)
I'm using the current regex to match email addresses in Python:
EMAIL_REGEX = re.compile(r"[^#]+#[^#]+\.[^#]+")
This works great, except for when I run it, I'm getting domains such as '.co.uk', etc. For my project, I am simply trying to get a count of international looking TLD's. I understand that this doesn't necessarily guarantee that my users are only US based, but it does give me a count of my users without internationally based TLD's (or what we would consider internationally based TLD's - .co.uk, .jp, etc).
What you want is very difficult.
If I make a mail server called this.is.my.email.my-domain.com, and an account called martin, my perfectly valid US email would be martin#this.is.my.email.my-domain.com. Emails with more than 1 domain part are not uncommon (.gov is a common example).
Disallowing emails from the .uk TLD is also problematic, since many US-based people might have a .uk address, for example they think it sounds nice, work for a UK based company, have a UK spouse, used to live in the UK and never changed emails, etc.
If you only want US-based registrations, your options are:
Ask your users if they are US-based, and tell them your service is only for US-based users if they answer with a non-US country.
Ask for a US address or phone number. Although this can be faked, it's not easy to get a matching address & ZIP code, for example.
Use GeoIP, and allow only US email addresses. This is not fool-proof, since people can use your service on holidays and such.
In the question's comments, you said:
Does it not make sense that if some one has a .jp TLD, or .co.uk, it stands to reason (with considerable accuracy) that they are internationally based?
Usually, yes. But far from always. My girlfriend has 4 .uk email addresses, and she doesn't live in the UK anymore :-) This is where you have to make a business choice, you can either:
Turn away potential customers
Take more effort in allowing customers with slightly "strange" email addresses
Your business, your choice ;-)
So, with that preamble, if you must do this, this is how you could do it:
import re
EMAIL_REGEX = re.compile(r'''
^ # Anchor to the start of the string
[^#]+ # Username
# # Literal #
([^#.]+){1} # One domain part
\. # Literal 1
([^#.]+){1} # One domain part (the TLD)
$ # Anchor to the end of the string
''', re.VERBOSE)
print(EMAIL_REGEX.search('test#example.com'))
print(EMAIL_REGEX.search('test#example.co.uk'))
Of course, this still allows you to register with a .nl address, for example. If you want to allow only a certain set of TLD's, then use:
allow_tlds = ['com', 'net'] # ... Probably more
result = EMAIL_REGEX.search('test#example.com')
if result is None or result.groups()[1] in allowed_tlds:
print('Not allowed')
However, if you're going to create a whilelist, then you don't need the regexp anymore, since not using it will allow US people with multi-domain addresses to sign up (such as #nlm.nih.gov).
I'm creating a webapplication in Python (and Flask) where a user can register with their wanted username. I would like to show their profile at /user/ and have a directory on the server for each user.
What is the best way to make sure the username is secure for both a url and directory? I read about people using the urlsafe methods in base64, but I would like to have a string that is related to their username for easy recognition.
The generic term for such URL-safe values is "slug", and the process of generating one is called "slugification", or "to slugify". People generally use a regular expression to do so; here is one (sourced from this article on the subject), using only stdlib modules:
import re
from unicodedata import normalize
_punct_re = re.compile(r'[\t !"#$%&\'()*\-/<=>?#\[\\\]^_`{|},.]+')
def slugify(text, delim=u'-'):
"""Generates an slightly worse ASCII-only slug."""
result = []
for word in _punct_re.split(text.lower()):
word = normalize('NFKD', word).encode('ascii', 'ignore')
if word:
result.append(word)
return unicode(delim.join(result))
The linked article has another 2 alternatives requiring additional modules.
We have successfully implemented in our Python+pyramid program Encrypted Website Payments for PayPal, except for a tiny detail: input sanitization. Namely, we would like to help the user by providing as much data as possible to the PayPal from our user database. Now, it occurred to me that a malicious user could change his name to 'Mr Hacker\nprice=0.00' or similar, and thus completely negate the security offered by EWP. I did try URL-encoding the values, but PayPal does not seem to decode the percent escapes in the file.
Our code is based on the django-paypal library; the library completely neglects this issue, outputting happily bare name=value pairs without any checks:
plaintext = 'cert_id=%s\n' % CERT_ID
for name, field in self.fields.iteritems():
value = None
if name in self.initial:
value = self.initial[name]
elif field.initial is not None:
value = field.initial
if value is not None:
# ### Make this less hackish and put it in the widget.
if name == "return_url":
name = "return"
plaintext += u'%s=%s\n' % (name, value)
plaintext = plaintext.encode('utf-8')
So, how does one properly format the input for dynamically encrypted buttons? Or is there a better way to achieve similar functionality in Website Payments Standard to avoid this problem, yet as secure?
Update
What we craft is a string with contents like
item_number=BASIC
p3=1
cmd=_xclick-subscriptions
business=business#business.com
src=1
item_name=Percent%20encoding%20and%20UTF-8:%20%C3%B6
charset=UTF-8
t3=M
a3=10.0
sra=1
cert_id=ABCDEFGHIJKLM
currency_code=EUR
and encrypt it for EWP; the user posts the form to https://www.sandbox.paypal.com/cgi-bin/webscr. When the user clicks on the button, the PayPal page "Log in to complete your checkout" the item name displayed is "Percent%20encoding%20and%20UTF-8:%20%C3%B6". Thus, for EWP input it seems that percent encoding is not decoded.
You could filter out key-value pairs with regular expressions;
>>> import re
>>> text = 'Mr Hacker\nprice=0.00\nsecurity=false'
>>> re.sub('[\n][^\s]+=[^\s]*', '', text)
'Mr Hacker'
Or even more simple, ditch everything after the first newline;
>>> text.splitlines()[0]
'Mr Hacker'
The latter assumes that the first line is correct, which might not be the case.