I am working on a huge email-address dataset in Python and need to retrieve the organization name.
For example, email#organizationName.com is easy to extract, but what about email#info.organizationName.com or even email#organizationName.co.uk?
I need a universal extractor that should be able to handle all different possibilities accordingly.
If organisationName is always before .com or other ending - this may work -
email_str.split('#')[1].split('.')[-2]
A regex won't work well here. In order to be able to reliably do this, you need to use a lib that has knowledge on what constitutes a valid suffix.
Otherwise, how would the extractor be able distinguish email#info.organizationName.com from email#organizationName.co.uk?
This can be done using tldextract:
Example:
import tldextract
emails = ['email#organizationName.com',
'email#info.organizationName.com',
'email#organizationName.co.uk',
'email#info.organizationName.co.uk',
]
for addr in emails:
print(tldextract.extract(addr))
Output:
ExtractResult(subdomain='', domain='organizationName', suffix='com')
ExtractResult(subdomain='info', domain='organizationName', suffix='com')
ExtractResult(subdomain='', domain='organizationName', suffix='co.uk')
ExtractResult(subdomain='info', domain='organizationName', suffix='co.uk')
To access just the domain, use tldextract.extract(addr).domain.
Related
I'm using the current regex to match email addresses in Python:
EMAIL_REGEX = re.compile(r"[^#]+#[^#]+\.[^#]+")
This works great, except for when I run it, I'm getting domains such as '.co.uk', etc. For my project, I am simply trying to get a count of international looking TLD's. I understand that this doesn't necessarily guarantee that my users are only US based, but it does give me a count of my users without internationally based TLD's (or what we would consider internationally based TLD's - .co.uk, .jp, etc).
What you want is very difficult.
If I make a mail server called this.is.my.email.my-domain.com, and an account called martin, my perfectly valid US email would be martin#this.is.my.email.my-domain.com. Emails with more than 1 domain part are not uncommon (.gov is a common example).
Disallowing emails from the .uk TLD is also problematic, since many US-based people might have a .uk address, for example they think it sounds nice, work for a UK based company, have a UK spouse, used to live in the UK and never changed emails, etc.
If you only want US-based registrations, your options are:
Ask your users if they are US-based, and tell them your service is only for US-based users if they answer with a non-US country.
Ask for a US address or phone number. Although this can be faked, it's not easy to get a matching address & ZIP code, for example.
Use GeoIP, and allow only US email addresses. This is not fool-proof, since people can use your service on holidays and such.
In the question's comments, you said:
Does it not make sense that if some one has a .jp TLD, or .co.uk, it stands to reason (with considerable accuracy) that they are internationally based?
Usually, yes. But far from always. My girlfriend has 4 .uk email addresses, and she doesn't live in the UK anymore :-) This is where you have to make a business choice, you can either:
Turn away potential customers
Take more effort in allowing customers with slightly "strange" email addresses
Your business, your choice ;-)
So, with that preamble, if you must do this, this is how you could do it:
import re
EMAIL_REGEX = re.compile(r'''
^ # Anchor to the start of the string
[^#]+ # Username
# # Literal #
([^#.]+){1} # One domain part
\. # Literal 1
([^#.]+){1} # One domain part (the TLD)
$ # Anchor to the end of the string
''', re.VERBOSE)
print(EMAIL_REGEX.search('test#example.com'))
print(EMAIL_REGEX.search('test#example.co.uk'))
Of course, this still allows you to register with a .nl address, for example. If you want to allow only a certain set of TLD's, then use:
allow_tlds = ['com', 'net'] # ... Probably more
result = EMAIL_REGEX.search('test#example.com')
if result is None or result.groups()[1] in allowed_tlds:
print('Not allowed')
However, if you're going to create a whilelist, then you don't need the regexp anymore, since not using it will allow US people with multi-domain addresses to sign up (such as #nlm.nih.gov).
Working on this assignment for a while now. The regex is not particularly difficult, but I don't quite follow how to get the output they want
Your program should:
Read the html of a webpage (which has been stored as textfile);
Extract all the domains referred to and list all the full http addresses related to these domains;
Extract all the resource types referred to and list all the full http * addresses related to these resource types.
Please solve the task using regular expressions and re functions/methods. I suggest using ‘finditer’ and ‘groups’ (there might be other possibilities). Please do not use string functions where re is better suited."
The output is supposed to look like this
www.fairfaxmedia.co.nz
http://www.fairfaxmedia.co.nz
www.essentialmums.co.nz
http://www.essentialmums.co.nz/
http://www.essentialmums.co.nz/
http://www.essentialmums.co.nz/
www.nzfishingnews.co.nz
http://www.nzfishingnews.co.nz/
www.nzlifeandleisure.co.nz
http://www.nzlifeandleisure.co.nz/
www.weatherzone.co.nz
http://www.weatherzone.co.nz/
www.azdirect.co.nz
http://www.azdirect.co.nz/
i.stuff.co.nz
http://i.stuff.co.nz/
ico
http://static.stuff.co.nz/781/3251781.ico
zip
http://static2.stuff.co.nz/1392867595/static/jwplayer/skin/Modieus.zip
mp4
http://file2.stuff.co.nz/1394587586/272/9819272.mp4
I really need help with how to filter stuff out so the output shows up like that?
create list of tuples (keyword, url)
sort it according to keyword
using itertools.groupby group per keyword
for each keyword, print keyword and then all urls (these to be printed indentend).
I have a large sets of urls. Some are similar to each other i.e. they represent the similar set of pages.
For eg.
http://example.com/product/1/
http://example.com/product/2/
http://example.com/product/40/
http://example.com/product/33/
are similar. Similarly
http://example.com/showitem/apple/
http://example.com/showitem/banana/
http://example.com/showitem/grapes/
are also similar. So i need to represent them as http://example.com/product/(Integers)/
where (Integers) = 1,2,40,33 and http://example.com/showitem/(strings)/ where strings = apple,banana,grapes ... and so on.
Is there any inbuilt function or library in python to do find these similar urls from large set of mixed urls? How can this be done more efficiently? Please suggest. Thanks in advance.
Use a string to store the first part of the URL and just handle IDs, example:
In [1]: PRODUCT_URL='http://example.com/product/%(id)s/'
In [2]: _ids = '1 2 40 33'.split() # split string into list of IDs
In [3]: for id in _ids:
...: print PRODUCT_URL % {'id':id}
...:
http://example.com/product/1/
http://example.com/product/2/
http://example.com/product/40/
http://example.com/product/33/
The statement print PRODUCT_URL % {'id':id} uses Python string formatting to format the product URL depending on the variable id passed.
UPDATE:
I see you've changed your question. The solution for your problem is quite domain-specific and depends on your data set. There are several approaches, some more manual than others. One such approach would be to get the top-level URLs i.e. to retrieve the domain name:
In [7]: _url = 'http://example.com/product/33/' # url we're testing with
In [8]: ('/').join(_url.split('/')[:3]) # get domain
Out[8]: 'http://example.com'
In [9]: ('/').join(_url.split('/')[:4]) # get domain + first URL sub-part
Out[9]: 'http://example.com/product'
[:3] and [:4] above are just slicing the list resulting from split('/')
You can set the result as a key on a dict for which you keep a count of each time you encounter the URL part. And move on from there. Again the solution depends on your data. If it gets more complex than above then I suggest you look into regex as the other answers suggest.
You can use regular expressions to handle that cases. You can go to the Python documentation to see how is this handle.
Also you can see how Django implement this on its routings system
I'm not exactly sure what specifically you are looking for. It sounds to me that you are looking for something to match URLs. If this is indeed what you want then I suggest you use something that is built using regular expressions. One example can be found here.
I also suggest you take a look at Django and its routing system.
Not in Python, but I've created a Ruby Library (and an accompanying app) --
https://rubygems.org/gems/LinkGrouper
It works on all links (doesn't need to know any pattern).
I'm writing a registration form that only needs to accept the local component of a desired email address. The domain component is fixed to the site. I am attempting to validate it by selectively copying from validators.validate_email which Django provides for EmailField:
email_re = re.compile(
r"(^[-!#$%&'*+/=?^_`{}|~0-9A-Z]+(\.[-!#$%&'*+/=?^_`{}|~0-9A-Z]+)*" # dot-atom
# quoted-string, see also http://tools.ietf.org/html/rfc2822#section-3.2.5
r'|^"([\001-\010\013\014\016-\037!#-\[\]-\177]|\\[\001-\011\013\014\016-\177])*"'
r')#((?:[A-Z0-9](?:[A-Z0-9-]{0,61}[A-Z0-9])?\.)+[A-Z]{2,6}\.?$)' # domain
r'|\[(25[0-5]|2[0-4]\d|[0-1]?\d?\d)(\.(25[0-5]|2[0-4]\d|[0-1]?\d?\d)){3}\]$', re.IGNORECASE) # literal form, ipv4 address (SMTP 4.1.3)
validate_email = EmailValidator(email_re, _(u'Enter a valid e-mail address.'), 'invalid')
Following is my code. My main issue is that I'm unable to adapt the regex. At this point I'm only testing it in a regex tester at http://www.pythonregex.com/ however it's failing:
^([-!#$%&'*+/=?^_`{}|~0-9A-Z]+(\.[-!#$%&'*+/=?^_`{}|~0-9A-Z]+)*)$
This seems to be passing undesirable characters such as ?
The entire code for my Field, which is not necessarily relevant at this stage but I wouldn't mind some comment on it would be:
class LocalEmailField(CharField):
email_local_re = re.compile(r"^([-!#$%&'*+/=?^_`{}|~0-9A-Z]+(\.[-!#$%&'*+/=?^_`{}|~0-9A-Z]+)*)$", re.IGNORECASE)
validate_email_local = RegexValidator(email_re, (u'Enter a valid e-mail username.'), 'invalid')
default_validators = [validate_email_local]
EDIT: To clarify, the user is only entering the text BEFORE the #, hence why I have no need to validate the #domain.com in the validator.
EDIT 2: So the form field and label will look like this:
Desired Email Address: [---type-able area---] #domain.com
You say "undesirable characters such as ?", but I think you're mistaken about what characters are desirable. The original regex allows question marks.
Note that you can also define your own validator that doesn't use a massive regex, and have some chance of decoding the logic later.
Some people, when confronted with a problem, think, “I know, I’ll use
regular expressions.” Now they have two problems. - Jamie
Zawinski
Checking via regex is an exercise in wasting your time. The best way is to attempt delivery; this way not only can you verify the email address, but also if the mailbox is actually active and can receive emails.
Otherwise you'll end up in an every-expanding regular expression that can't possibly hope to match all the rules.
"Haha boo hoo woo woo!"#foo.com is a valid address, so is qwerterukeriouo#gmail.com
Instead, offer the almost-standard "Please click on the link in the email we sent to blahblah#goo.com to verify your address." approach.
If you want to create email addresses, then you can write your own rules on what can be a part of the email component; and they can be a subset of the official allowed chars in the RFC.
For example, a conservative rule (that doesn't use regular expressions):
allowed_chars = [string.digits+string.letters+'-']
if len([x in user_input if x not in allowed_chars]):
print 'Sorry, invalid characters'
else:
if user_input[0] in string.digits+'-':
print 'Cannot start with a number or `-`'
else:
if check_if_already_exists(user_input):
print 'Sorry, already taken'
else:
print 'Congratulations!'
I'm still new to Django and Python, but why reinvent the wheel and maintain your own regex? If, apart from wanting users to enter only the local portion of their email address, you're happy with Django's built-in EmailField, you can subclass it quite easily and tweak the validation logic a bit:
DOMAIN_NAME = u'foo.com'
class LocalEmailField(models.EmailField):
def clean(local_part):
whole_address = '%s#%s' % (local_part, DOMAIN_NAME)
clean_address = super(LocalEmailField, self).clean(whole_address)
# Can do more checking here if necessary
clean_local, at_sign, clean_domain = clean_address.rpartition('#')
return clean_local
Have you looked at the documentation for Form and Field Validation and the .clean() method?
If you want to do it 100% correctly with regex, you need to use an engine with some form of extended regex which allow matching nested parentheses.
Python's default engine does not allow this, so you're better off compromising with a very simple (permissive) regex.
I have a large number of email addresses to validate. Initially I parse them with a regexp to throw out the completely crazy ones. I'm left with the ones that look sensible but still might contain errors.
I want to find which addresses have valid domains, so given me#abcxyz.com I want to know if it's even possible to send emails to abcxyz.com .
I want to test that to see if it corresponds to a valid A or MX record - is there an easy way to do it using only Python standard library? I'd rather not add an additional dependency to my project just to support this feature.
There is no DNS interface in the standard library so you will either have to roll your own or use a third party library.
This is not a fast-changing concept though, so the external libraries are stable and well tested.
The one I've used successful for the same task as your question is PyDNS.
A very rough sketch of my code is something like this:
import DNS, smtplib
DNS.DiscoverNameServers()
mx_hosts = DNS.mxlookup(hostname)
# Just doing the mxlookup might be enough for you,
# but do something like this to test for SMTP server
for mx in mx_hosts:
smtp = smtplib.SMTP()
#.. if this doesn't raise an exception it is a valid MX host...
try:
smtp.connect(mx[1])
except smtplib.SMTPConnectError:
continue # try the next MX server in list
Another library that might be better/faster than PyDNS is dnsmodule although it looks like it hasn't had any activity since 2002, compared to PyDNS last update in August 2008.
Edit: I would also like to point out that email addresses can't be easily parsed with a regexp. You are better off using the parseaddr() function in the standard library email.utils module (see my answer to this question for example).
The easy way to do this NOT in the standard library is to use the validate_email package:
from validate_email import validate_email
is_valid = validate_email('example#example.com', check_mx=True)
For faster results to process a large number of email addresses (e.g. list emails, you could stash the domains and only do a check_mx if the domain isn't there. Something like:
emails = ["email#example.com", "email#bad_domain", "email2#example.com", ...]
verified_domains = set()
for email in emails:
domain = email.split("#")[-1]
domain_verified = domain in verified_domains
is_valid = validate_email(email, check_mx=not domain_verified)
if is_valid:
verified_domains.add(domain)
An easy and effective way is to use a python package named as validate_email.
This package provides both the facilities. Check this article which will help you to check if your email actually exists or not.