count distinct gmail addresses - python

need to count distinct gmail addresses provided by user as input, here are the conditions:
not case sensitive :
"a#gmail.com" == "A#GmaiL.com"
The ‘.’ character in the string in the local name is ignored :
"aa#gmail.com" == "a.a#gmail.com"
Gmail domain is same as googlemail
"aa#gmail.com" == "aa#googlemail.com"
my issue is with the very last one. How to implement the last condition in my code?
distinct_emails=[]
email = []
count=0
for i in range(int(input())):
item = input().lower().replace(".","")
email.append(item)
for i in email:
if i not in distinct_emails:
count = count + 1
distinct_emails.append(i)
print(count)

You could try something like this, where for gmail and googlemail addresses, you check for the swapped versions before appending it to the distinct_emails list.
distinct_emails=[]
email = []
count=0
for i in range(int(input())):
item = e.lower()
# don't remove `.` after the `#`.
parts = item.split("#")
email.append(parts[0].replace(".", "") + "#" + parts[1])
for i in email:
# consider googlemail and gmail to be equivalent
if not any(e in distinct_emails for e in [i, i.replace('#googlemail.com', '#gmail.com'), i.replace('#gmail.com', '#googlemail.com')]):
count = count + 1
distinct_emails.append(i)
print(count)

Related

Extract Text from a word document

I am trying to scrape data from a word document available at:-
https://dl.dropbox.com/s/pj82qrctzkw9137/HE%20Distributors.docx
I need to scrape the Name, Address, City, State, and Email ID. I am able to scrape the E-mail using the below code.
import docx
content = docx.Document('HE Distributors.docx')
location = []
for i in range(len(content.paragraphs)):
stat = content.paragraphs[i].text
if 'Email' in stat:
location.append(i)
for i in location:
print(content.paragraphs[i].text)
I tried to use the steps mentioned:
How to read data from .docx file in python pandas?
I need to convert this into a data frame with all the columns mentioned above.
Still facing issues with the same.
There are some inconsistencies in the document - phone numbers starting with Tel: sometimes, and Tel.: other times, and even Te: once, and I noticed one of the emails is just in the last line for that distributor without the Email: prefix, and the State isn't always in the last line.... Still, for the most part, most of the data can be extracted with regex and/or splits.
The distributors are separated by empty lines, and the names are in a different color - so I defined this function to get the font color of any paragraph from its xml:
# from bs4 import BeautifulSoup
def getParaColor(para):
try:
return BeautifulSoup(
para.paragraph_format.element.xml, 'xml'
).find('color').get('w:val')
except:
return ''
The try...except hasn't been necessary yet, but just in case...
(The xml is actually also helpful for double-checking that .text hasn't missed anything - in my case, I noticed that the email for Shri Adhya Educational Books wasn't getting extracted.)
Then, you can process the paragraphs from docx.Document with a function like:
# import re
def splitParas(paras):
ptc = [(
p.text, getParaColor(p), p.paragraph_format.element.xml
) for p in paras]
curSectn = 'UNKNOWN'
splitBlox = [{}]
for pt, pc, px in ptc:
# double-check for missing text
xmlText = BeautifulSoup(px, 'xml').text
xmlText = ' '.join([s for s in xmlText.split() if s != ''])
if len(xmlText) > len(pt): pt = xmlText
# initiate
if not pt:
if splitBlox[-1] != {}:
splitBlox.append({})
continue
if pc == '20752E':
curSectn = pt.strip()
continue
if splitBlox[-1] == {}:
splitBlox[-1]['section'] = curSectn
splitBlox[-1]['raw'] = []
splitBlox[-1]['Name'] = []
splitBlox[-1]['address_raw'] = []
# collect
splitBlox[-1]['raw'].append(pt)
if pc == 'D12229':
splitBlox[-1]['Name'].append(pt)
elif re.search("^Te.*:.*", pt):
splitBlox[-1]['tel_raw'] = re.sub("^Te.*:", '', pt).strip()
elif re.search("^Mob.*:.*", pt):
splitBlox[-1]['mobile_raw'] = re.sub("^Mob.*:", '', pt).strip()
elif pt.startswith('Email:') or re.search(".*[#].*[.].*", pt):
splitBlox[-1]['Email'] = pt.replace('Email:', '').strip()
else:
splitBlox[-1]['address_raw'].append(pt)
# some cleanup
if splitBlox[-1] == {}: splitBlox = splitBlox[:-1]
for i in range(len(splitBlox)):
addrsParas = splitBlox[i]['address_raw'] # for later
# join lists into strings
splitBlox[i]['Name'] = ' '.join(splitBlox[i]['Name'])
for k in ['raw', 'address_raw']:
splitBlox[i][k] = '\n'.join(splitBlox[i][k])
# search address for City, State and PostCode
apLast = addrsParas[-1].split(',')[-1]
maybeCity = [ap for ap in addrsParas if '–' in ap]
if '–' not in apLast:
splitBlox[i]['State'] = apLast.strip()
if maybeCity:
maybePIN = maybeCity[-1].split('–')[-1].split(',')[0]
maybeCity = maybeCity[-1].split('–')[0].split(',')[-1]
splitBlox[i]['City'] = maybeCity.strip()
splitBlox[i]['PostCode'] = maybePIN.strip()
# add mobile to tel
if 'mobile_raw' in splitBlox[i]:
if 'tel_raw' not in splitBlox[i]:
splitBlox[i]['tel_raw'] = splitBlox[i]['mobile_raw']
else:
splitBlox[i]['tel_raw'] += (', ' + splitBlox[i]['mobile_raw'])
del splitBlox[i]['mobile_raw']
# split tel [as needed]
if 'tel_raw' in splitBlox[i]:
tel_i = [t.strip() for t in splitBlox[i]['tel_raw'].split(',')]
telNum = []
for t in range(len(tel_i)):
if '/' in tel_i[t]:
tns = [t.strip() for t in tel_i[t].split('/')]
tel1 = tns[0]
telNum.append(tel1)
for tn in tns[1:]:
telNum.append(tel1[:-1*len(tn)]+tn)
else:
telNum.append(tel_i[t])
splitBlox[i]['Tel_1'] = telNum[0]
splitBlox[i]['Tel'] = telNum[0] if len(telNum) == 1 else telNum
return splitBlox
(Since I was getting font color anyway, I decided to add another
column called "section" to put East/West/etc in. And I added "PostCode" too, since it seems to be on the other side of "City"...)
Since "raw" is saved, any other value can be double checked manually at least.
The function combines "Mobile" into "Tel" even though they're extracted with separate regex.
I'd say "Tel_1" is fairly reliable, but some of the inconsistent patterns mean that other numbers in "Tel" might come out incorrect if they were separated with '/'.
Also, "Tel" is either a string or a list of strings depending on how many numbers there were in "tel_raw".
After this, you can just view as DataFrame with:
#import docx
#import pandas
content = docx.Document('HE Distributors.docx')
# pandas.DataFrame(splitParas(content.paragraphs)) # <--all Columns
pandas.DataFrame(splitParas(content.paragraphs))[[
'section', 'Name', 'address_raw', 'City',
'PostCode', 'State', 'Email', 'Tel_1', 'tel_raw'
]]

Extract emails from the entered values in python

My objective is: I need to ask the user to enter the number of emails and then initiate a for loop to register the input email. Then, the emails will be segregated based on '#professor.com' and '#student.com'. This will be counted as appended in the list. Following is what I have tried
email_count = int(input('how many emails you want'))
student_email_count = 0
professor_email_count = 0
student_email = '#student.com'
professor_email = '#professor.com'
for i in range(email_count):
email_list = str(input('Enter the emails')
if email_list.index(student_email):
student_email_count = student_email_count + 1
elif email_list.index(professor_email):
professor_email_count = professor_email_count + 1
Can someone help to shorten this and write it professional with explanations for further reference? Here, the appending part is missing. Could someone through some light there too ?
Thanks
prof_email_count, student_email_count = 0, 0
for i in range(int(input("Email count # "))):
email = input("Email %s # " % (i+1))
if email.endswith("#student.com"): # str.endswith(s) checks whether `str` ends with s, returns boolean
student_email_count += 1
elif email.endswith("#professor.com"):
prof_email_count += 1
Is what a (somewhat) shortened rendition of your code would look like. Main differences is that I'm using str.endswith(...) over str.index(...), also that I've removed the email_count, student_email and professor_email variables which didn't seem to be used anywhere else in the context.
EDIT:
To answer your comment on scalability, you could implement a system such as this:
domains = {
"student.com": 0,
"professor.com": 0,
"assistant.com": 0
}
for i in range(int(input("Email count # "))):
email = input("Email %s # " % (i+1))
try:
domain = email.split('#')[1]
except IndexError:
print("No/invalid domain passed")
continue
if domain not in domains:
print("Domain: %s, invalid." % domain)
continue
domains[domain] += 1
Which allows for further scalability as you can add more domains to the domains dictionary, and access the count per domains[<domain>]
It seems your iteration accepts one email at a time; and executes email_count times. You can use this simple code to count students and professors:
st = '#student.com'
prof = '#professor.com'
for i in range(email_count):
email = str(input('Enter the email'))
if st in email:
student_email_count += 1
elif prof in email:
professor_email_count += 1
else:
print('invalid email domain')
If you are using Python 2.7, you should change input to raw_input.
Here's a scalable version for your code, using defaultdict to support unlimited domains.
email_count = int(input('how many emails you want'))
student_email_count = 0
professor_email_count = 0
from collections import defaultdict
domains = defaultdict(int)
for i in range(email_count):
email = str(raw_input('Enter the email\n'))
try:
email_part = email.split('#')[1]
except IndexError:
print('invalid email syntax')
else:
domains[email_part] += 1

Timeout error in "simple" code

I have the following statement:
There are two type of commands:
store email_content urgency, where email_content is a String consisting of [a-zA-Z0-9_] and urgency is a positive integer
get_next_email - a request that expects email_content with highest urgency as a response. If there are 1+ emails with same urgency, output the one that was stored first. If there are no outstanding emails print "-1" string on a newline.
Input example:
4
store email1 1
store email2 10
store email3 10
get_next_email
Output:
email2
My code:
emails = {}
urgencies = set()
for _ in range(int(input())):
info = input().strip().split(" ")
if info[0] == "store":
info[2] = int(info[2])
urgencies.add(info[2])
if info[2] in emails:
emails[info[2]].append(info[1])
else:
emails[info[2]] = [info[1]]
elif info[0] == "get_next_email":
if emails:
maxval = max(urgencies)
print(emails[maxval].pop(0))
if emails[maxval] == []:
emails.pop(maxval, None)
urgencies.remove(maxval)
else:
print("-1")
I get a runtime error, which according to what I've read is a consecuence of the compexity being very big, and I don't think my code is that complex. What am I doing wrong? And what can I do to improve this? Thanks!
EDIT: It's a contest, the exact error is: Terminated due to timeout (10s)

Bad search filter

Im trying to filter few attributes from the ldap server but get errors,
ldap.FILTER_ERROR: {'desc': 'Bad search filter'}
Code:-
import ldap
ldap.OPT_REFERRALS = 0
ldap_server="ldapps.test.com"
username = "testuser"
password= "" #your password
connect = ldap.open(ldap_server)
dn='uid='+username;
print 'dn =', dn
try:
result = connect.simple_bind_s(username,password)
print 'connected == ', result
filter1 = "(|(uid=" + username + "\*))"
result = connect.search("DC=cable,DC=com,DC=com",ldap.SCOPE_SUBTREE,filter1)
print result
except ldap.INVALID_CREDENTIALS as e:
connect.unbind_s()
print "authentication error == ", e
Your search filter is, in fact, bad.
The | character is for joining several conditions together in an OR statement. For example, if you wanted to find people with a last name of "smith", "jones", or "baker", you would use this filter:
(|(lastname=smith)(lastname=jones)(lastname=baker))
However, your filter only has one condition, so there's nothing for the | character to join together. Change your filter to this and it should work:
"(uid=" + username + "\*)"
By the way, what are you trying to do with the backslash and asterisk? Are you looking for people whose usernames actually end with an asterisk?

How do I catch "split" exceptions in python?

I am trying to parse a list of email addresses to remove the username and '#' symbol only leaving the domain name.
Example: blahblah#gmail.com
Desired output: gmail.com
I have accomplished this with the following code:
for row in cr:
emailaddy = row[0]
(emailuser, domain) = row[0].split('#')
print domain
but my issue is when I encounter a improperly formatted email address. For example if the row contains "aaaaaaaaa" (instead of a valid email address) the program crashes with the error
(emailuser, domain) = row[0].split('#')
ValueError: need more than 1 value to unpack.
(as you would expect) Rather than check all the email addresses for their validity, I would rather just not update grab the domain and move on to the next record. How can I properly handle this error and just move on?
So for the list of:
blahblah#gmail.com
mmymymy#hotmail.com
youououou
nonononon#yahoo.com
I would like the output to be:
gmail.com
hotmail.com
yahoo.com
Thanks!
You want something like this?
try:
(emailuser, domain) = row[0].split('#')
except ValueError:
continue
You can just filter out the address which does not contain #.
>>> [mail.split('#')[1] for mail in mylist if '#' in mail]
['gmail.com', 'hotmail.com', 'yahoo.com']
>>>
What about
splitaddr = row[0].split('#')
if len(splitaddr) == 2:
domain = splitaddr[1]
else:
domain = ''
This even handles cases like aaa#bbb#ccc and makes it invalid ('').
Try this
In [28]: b = ['blahblah#gmail.com',
'mmymymy#hotmail.com',
'youououou',
'nonononon#yahoo.com']
In [29]: [x.split('#')[1] for x in b if '#' in x]
Out[29]: ['gmail.com', 'hotmail.com', 'yahoo.com']
This does what you want:
import re
l=["blahblah#gmail.com","mmymymy#hotmail.com",
"youououou","nonononon#yahoo.com","amy#bong#youso.com"]
for e in l:
if '#' in e:
l2=e.split('#')
print l2[-1]
else:
print
Output:
gmail.com
hotmail.com
yahoo.com
youso.com
It handles the case where an email might have more than one '#' and just takes the RH of that.
if '#' in row[0]:
user, domain = row[0].split('#')
print domain
Maybe the best solution is to avoid exception handling all together.
You can do this by using the builtin function partition(). It is similar to split() but does not raise ValueError when the seperator is not found. Read more: https://docs.python.org/3/library/stdtypes.html#str.partition
We can consider the string not having '#' symbol, as a simple username:
try:
(emailuser, domain) = row[0].split('#')
print "Email User" + emailuser
print "Email Domain" + domain
except ValueError:
emailuser = row[0]
print "Email User Only" + emailuser
O/P:
Email User : abc
Email Domain : gmail.com
Email User : xyz
Email Domain : gmail.com
Email User Only : usernameonly

Categories