python to extract list match from email thread - python

I m new to python. I need to retrieve the list of match
for Example my text is below which is an email.
I need to extract all To, From, Sent, Subject and body from a mail thread.
Result need to From List
From(1) = Crandall, Sean
From(2) = Nettelton, Marcus
To(1)= Crandall, Sean; Badeer, Robert
To(2)= Meredith, Kevin
Like for above Sent, subject etc
"-----Original Message-----
From: Crandall, Sean
Sent: Wednesday, May 23, 2001 2:56 PM
To: Meredith, Kevin
Subject: RE: Spreads and Product long desc.
Kevin,
Is the SP and NP language in the spread language the same language we use when we transact SP15 or NP15 on eol?
-----Original Message-----
From: Meredith, Kevin
Sent: Wednesday, May 23, 2001 11:16 AM
To: Crandall, Sean; Badeer, Robert
Subject: FW: Spreads and Product long desc."

You can use re.findall() for this, see: https://docs.python.org/2/library/re.html#re.findall. E.g.
re.findall("From: (.*) ", input_string);
would return a list of the From-names (['Crandall, Sean', 'Meredith, Kevin']), assuming it's always the same amount of white spaces.
If you want to get fancy, you could do several searches in the same expression: E.g.
re.findall("From: (.*) \nSent: (.*)", input_string);
would return [('Crandall, Sean', 'Wednesday, May 23, 2001 2:56 PM'), ('Meredith, Kevin', 'Wednesday, May 23, 2001 11:16 AM')]

If you don't know how to use regex and as your problem is not that tough, you may consider to use the split() and replace() functions.
Here are some lines of code that might be a good start:
mails = """-----Original Message-----
From: Crandall, Sean
Sent: Wednesday, May 23, 2001 2:56 PM
To: Meredith, Kevin
Subject: RE: Spreads and Product long desc.
Kevin,
Is the SP and NP language in the spread language the same language we use when we transact SP15 or NP15 on eol?
-----Original Message-----
From: Meredith, Kevin
Sent: Wednesday, May 23, 2001 11:16 AM
To: Crandall, Sean; Badeer, Robert
Subject: FW: Spreads and Product long desc."""
mails_list = mails.split("-----Original Message-----\n")
mails_from = []
mails_sent = []
mails_to = []
mails_subject = []
mails_body = []
for mail in mails_list:
if not mail:
continue
inter = mail.split("From: ")[1].split("\nSent: ")
mails_from.append(inter[0])
inter = inter[1].split("\nTo: ")
mails_sent.append(inter[0])
inter = inter[1].split("\nSubject: ")
mails_to.append(inter[0])
inter = inter[1].split("\n")
mails_subject.append(inter[0])
mails_body.append(inter[0])
See how this only use really basic concepts.
Here are some points that you might need to consider:
Try by yourself, you might need some adjustments.
With that method, the parsing method is quite tough, the format of the mails must be really accurate.
There might be some space that you want to remove, for example with the replace() method.

Related

How to extract a human name in a string python

I have a string out of an OCR'ed image, and I need to find a way to extract human names from it. here is the image required to OCR, which comes out as:
From: Al Amri, Salim <salim.amri#gmail.com>
Sent: 25 August 2021 17:20
To: Al Harthi, Mohammed <mohd4.king#rihal.om>
Ce: Al hajri, Malik <hajri990#ocaa.co.om>; Omar, Naif <nnnn49#apple.com>
Subject: Conference Rooms Booking Details
Dear Mohammed,
As per our last discussion these are the available conference rooms available for booking along
with their rates for full day:
Room: Luban, available on 26/09/2021. Rate: $4540
Room: Mazoon, available on 04/12/2021 and 13/02/2022. Rate: $3000
Room: Dhofar. Available on 11/11/2021. Rate: $2500
Room: Nizwa. Available on 13/12/2022. Rate: $1200
Please let me know which ones you are interested so we go through more details.
Best regards,
Salim Al Amri
There are 4 names in total in the heading, and i am required to get the output:
names = 'Al Hajri, Malik', 'Omar, Naif', 'Al Amri, Salim', 'Al Harthy, Mohammed' #desired output
but I have no idea how to extract the names. I have tried RegEx and came up with:
names = re.findall(r'(?i)([A-Z][a-z]+[A-Z][a-z][, ] [A-Z][a-z]+)', string) #regex to find names
which searches for a Capital letter, then a comma, then another word starting with a capital letter. it is close to the desired result but it comes out as:
names = ['Amri, Salim', 'Harthi, Mohammed', 'hajri, Malik', 'Omar, Naif', 'Luban, available', 'Mazoon, available'] #acutal result
I have thought of maybe using another string that extracts the room names and excludes them from the list, but i have no idea how to implement that idea. i am new to RegEx, so any help will be appreciated. thanks in advance
Notwithstanding the excellent RE approach suggested by #JvdV, here's a step-by-step way in which you could achieve this:
OCR = """From: Al Amri, Salim <salim.amri#gmail.com>
Sent: 25 August 2021 17:20
To: Al Harthi, Mohammed <mohd4.king#rihal.om>
Ce: Al hajri, Malik <hajri990#ocaa.co.om>; Omar, Naif <nnnn49#apple.com>
Subject: Conference Rooms Booking Details
Dear Mohammed,
As per our last discussion these are the available conference rooms available for booking along
with their rates for full day:
Room: Luban, available on 26/09/2021. Rate: $4540
Room: Mazoon, available on 04/12/2021 and 13/02/2022. Rate: $3000
Room: Dhofar. Available on 11/11/2021. Rate: $2500
Room: Nizwa. Available on 13/12/2022. Rate: $1200
Please let me know which ones you are interested so we go through more details.
Best regards,
Salim Al Amri"""
names = []
for line in OCR.split('\n'):
tokens = line.split()
if tokens and tokens[0] in ['From:', 'To:', 'Ce:']: # Ce or Cc ???
parts = line.split(';')
for i, p in enumerate(parts):
names.append(' '.join(p.split()[i==0:-1]))
print(names)
Depending on the contents of your email, a reasonable approach might be to use:
[:;]\s*(.+?)\s*<
See an online demo.
[:;] - A (semi-)colon;
\s* - 0+ (Greedy) whitespaces;
(.+?) - A 1st capture group of 1+ (Lazy) characters;
\s* - 0+ (Greedy) whitespaces;
< - A literal '<'.
Note that I specifically use (.+?) to capture names since names are notoriously hard to match.
import re
s = """From: Al Amri, Salim <salim.amri#gmail.com>
Sent: 25 August 2021 17:20
To: Al Harthi, Mohammed <mohd4.king#rihal.om>
Ce: Al hajri, Malik <hajri990#ocaa.co.om>; Omar, Naif <nnnn49#apple.com>
Subject: Conference Rooms Booking Details
Dear Mohammed,
As per our last discussion these are the available conference rooms available for booking along
with their rates for full day:
Room: Luban, available on 26/09/2021. Rate: $4540
Room: Mazoon, available on 04/12/2021 and 13/02/2022. Rate: $3000
Room: Dhofar. Available on 11/11/2021. Rate: $2500
Room: Nizwa. Available on 13/12/2022. Rate: $1200
Please let me know which ones you are interested so we go through more details.
Best regards,
Salim Al Amri"""
print(re.findall(r'[:;]\s*(.+?)\s*<', s))
Prints:
['Al Amri, Salim', 'Al Harthi, Mohammed', 'Al hajri, Malik', 'Omar, Naif']

Regex to Remove Name, Address, Designation from an email text Python

I have a sample text of an email like this. I want to keep only the body of the text and remove names, address, designation, company name, email address from the text. So, to be clear, I only want the content of each mails between the From Dear/Hi/Hello to Sincerely/Regards/Thanks. How to do this efficiently using a regex or some other way
Subject: [EXTERNAL] RE: QUERY regarding supplement 73
Hi Roger,
Yes, an extension until June 22, 2018 is acceptable.
Regards,
Loren
Subject: [EXTERNAL] RE: QUERY regarding supplement 73
Dear Loren,
We had initial discussion with the ABC team us know if you would be able to extend the response due date to June 22, 2018.
Best Regards,
Mr. Roger
Global Director
roger#abc.com
78 Ford st.
Subject: [EXTERNAL] RE: QUERY regarding supplement 73
responding by June 15, 2018.check email for updates
Hello,
John Doe
Senior Director
john.doe#pqr.com
Subject: [EXTERNAL] RE: QUERY regarding supplement 73
Please refer to your January 12, 2018 data containing labeling supplements to add text regarding this
symptom. We are currently reviewing your supplements and have
made additional edits to your label.
Feel free to contact me with any questions.
Warm Regards,
Mr. Roger
Global Director
roger#abc.com
78 Ford st.
Center for Research
Office of New Discoveries
Food and Drug Administration
Loren#mno.com
From this text I only want as OUTPUT :
Subject: [EXTERNAL] RE: QUERY regarding supplement 73
Yes, an extension until June 22, 2018 is acceptable.
We had initial discussion with the ABC team us know if you would be able to extend the response due date to June 22, 2018.
responding by June 15, 2018.check email for updates
Please refer to your January 12, 2018 data containing labeling supplements to add text regarding this
symptom. We are currently reviewing your supplements and have
made additional edits to your label.
Feel free to contact me with any questions.
Below is an answer that works for your current input. The code will have to be adjusted when you process examples that fall outside the parameters outlined in the code below.
with open('email_input.txt') as input:
# List to store the cleaned lines
clean_lines = []
# Reads until EOF
lines = input.readlines()
# Remove some of the extra lines
no_new_lines = [i.strip() for i in lines]
# Convert the input to all lowercase
lowercase_lines = [i.lower() for i in no_new_lines]
# Boolean state variable to keep track of whether we want to be printing lines or not
lines_to_keep = False
for line in lowercase_lines:
# Look for lines that start with a subject line
if line.startswith('subject: [external]'):
# set lines_to_keep true and start capturing lines
lines_to_keep = True
# Look for lines that start with a salutation
elif line.startswith("regards,") or line.startswith("warm regards,") \
or line.startswith("best regards,") or line.startswith("hello,"):
# set lines_to_keep false and stop capturing lines
lines_to_keep = False
if lines_to_keep:
# regex to catch greeting lines
greeting_component = re.compile(r'(dear.*,|(hi.*,))', re.IGNORECASE)
remove_greeting = re.match(greeting_component, line)
if not remove_greeting:
if line not in clean_lines:
clean_lines.append(line)
for item in clean_lines:
print (item)
# output
subject: [external] re: query regarding supplement 73
yes, an extension until june 22, 2018 is acceptable.
we had initial discussion with the abc team us know if you would be able to
extend the response due date to june 22, 2018.
responding by june 15, 2018.check email for updates
please refer to your january 12, 2018 data containing labeling supplements
to add text regarding this symptom. we are currently reviewing your
supplements and have made additional edits to your label.
feel free to contact me with any questions.

Extract words/sentence that occurs before a keyword from a string - Python

I have a string like this,
my_str ='·in this match, dated may 1, 2013 (the "the match") is between brooklyn centenniel, resident of detroit, michigan ("champion") and kamil kubaru, the challenger from alexandria, virginia ("underdog").'
Now, I want to extract the current champion and the underdog using keywords champion and underdog .
What is really challenging here is both contender's names appear before the keyword inside parenthesis. I want to use regular expression and extract information.
Following is what I did,
champion = re.findall(r'("champion"[^.]*.)', my_str)
print(champion)
>> ['"champion") and kamil kubaru, the challenger from alexandria, virginia ("underdog").']
underdog = re.findall(r'("underdog"[^.]*.)', my_str)
print(underdog)
>>['"underdog").']
However, I need the results, champion as:
brooklyn centenniel, resident of detroit, michigan
and the underdog as:
kamil kubaru, the challenger from alexandria, virginia
How can I do this using regular expression? (I have been searching, if I could go back couple or words from the keyword to get the result I want, but no luck yet) Any help or suggestion would be appreciated.
You can use named captured group to capture the desired results:
between\s+(?P<champion>.*?)\s+\("champion"\)\s+and\s+(?P<underdog>.*?)\s+\("underdog"\)
between\s+(?P<champion>.*?)\s+\("champion"\) matches the chunk from between to ("champion") and put the desired portion in between as the named captured group champion
After that, \s+and\s+(?P<underdog>.*?)\s+\("underdog"\) matches the chunk upto ("underdog") and again get the desired portion from here as named captured group underdog
Example:
In [26]: my_str ='·in this match, dated may 1, 2013 (the "the match") is between brooklyn centenniel, resident of detroit, michigan ("champion") and kamil kubaru, the challenger from alexandria, virginia
...: ("underdog").'
In [27]: out = re.search(r'between\s+(?P<champion>.*?)\s+\("champion"\)\s+and\s+(?P<underdog>.*?)\s+\("underdog"\)', my_str)
In [28]: out.groupdict()
Out[28]:
{'champion': 'brooklyn centenniel, resident of detroit, michigan',
'underdog': 'kamil kubaru, the challenger from alexandria, virginia'}
There will be a better answer than this, and I don't know regex at all, but I'm bored, so here's my 2 cents.
Here's how I would go about it:
words = my_str.split()
index = words.index('("champion")')
champion = words[index - 6:index]
champion = " ".join(champion)
for the underdog, you will have to change the 6 to a 7, and '("champion")' to '("underdog").'
Not sure if this will solve your problem, but for this particular string, this worked when I tested it.
You could also use str.strip() to remove punctuation if that trailing period on underdog is a problem.

Python extract both names *AND* emails from body using regex in a single swoop

Python3
I need help creating a regex to extract names and emails from a forwarded email body, which will look similar to this always (real emails replaced by dummy emails):
> Begin forwarded message:
> Date: December 20, 2013 at 11:32:39 AM GMT-3
> Subject: My dummy subject
> From: Charlie Brown <aaa#aa-aaa.com>
> To: maria.brown#aaa.com, George Washington <george#washington.com>, =
thomas.jefferson#aaa.com, thomas.alva.edison#aaa.com, Juan =
<juan#aaa.com>, Alan <alan#aaa.com>, Alec <alec#aaa.com>, =
Alejandro <aaa#aaa.com>, Alex <aaa#planeas.com>, Andrea =
<andrea.mery#thomsen.cl>, Andrea <andrea.22#aaa.com>, Andres =
<andres#aaa.com>, Andres <avaldivieso#aaa.com>
> Hi,
> Please reply ASAP with your RSVP
> Bye
My first step was extracting all emails to a list with a custom function that I pass the whole email body to, like so:
def extract_emails(block_of_text):
t = r'\b[a-zA-Z0-9.-]+#[a-zA-Z0-9.-]+\.[a-zA-Z0-9.-]+\b'
return re.findall(t, block_of_text)
A couple of days ago I asked a question about extracting names using regex to help me build the function to extract all the names. My idea was to join both later on. I accepted an answer that performed what I asked, and came up with this other function:
def extract_names(block_of_text):
p = r'[:,] ([\w ]+) \<'
return re.findall(p, block_of_text)
My problem now was to make the extracted names match the extracted emails, mainly because sometimes there are less names than emails. So I thought, I could better try to build another regex to extract both names and emails,
This is my failed attempt to build such a regex.
[:,]([\w \<]+)([\w.-]+#[\w.-]+\.[\w.-]+)
REGEX101 LINK
Can anyone help and propose a nice, clean regex that grabs both name and email, to a list or dictionary of tuples? Thanks
EDIT:
The expected output of the regex in Python would be a list like this:
[(Charlie Brown', 'aaa#aaa.com'),('','maria.brown#aaa.com'),('George Washington', 'george#washington.com'),('','thomas.jefferson#aaa.com'),('','thomas.alva.edison#aaa.com'),('Juan','juan#aaa.com',('Alan', 'alan#aaa.com'), ('Alec', 'alec#aaa.com'),('Alejandro','aaa#aaa.com'),('Alex', 'aaa#aaa.com'),('Andrea','andrea.mery#thomsen.cl'),('Andrea','andrea.22#aaa.com',('Andres','andres#aaa.com'),('Andres','avaldivieso#aaa.com')]
Seems like you want something like this.,
[:,]\s*=?\s*(?:([A-Z][a-z]+(?:\s[A-Z][a-z]+)?))?\s*=?\s*.*?([\w.]+#[\w.-]+)
DEMO
>>> import re
>>> s = """ > Begin forwarded message:
>=20
> Date: December 20, 2013 at 11:32:39 AM GMT-3
> Subject: My dummy subject
> From: Charlie Brown <aaa#aa-aaa.com>
> To: maria.brown#aaa.com, George Washington <george#washington.com>, =
thomas.jefferson#aaa.com, thomas.alva.edison#aaa.com, Juan =
<juan#aaa.com>, Alan <alan#aaa.com>, Alec <alec#aaa.com>, =
Alejandro <aaa#aaa.com>, Alex <aaa#planeas.com>, Andrea =
<andrea.mery#thomsen.cl>, Andrea <andrea.22#aaa.com>, Andres =
<andres#aaa.com>, Andres <avaldivieso#aaa.com>
> Hi,
> Please reply ASAP with your RSVP
> Bye"""
>>> re.findall(r'[:,]\s*=?\s*(?:([A-Z][a-z]+(?:\s[A-Z][a-z]+)?))?\s*=?\s*.*?([\w.]+#[\w.-]+)', s)
[('Charlie Brown', 'aaa#aa-aaa.com'), ('', 'maria.brown#aaa.com'), ('George Washington', 'george#washington.com'), ('', 'thomas.jefferson#aaa.com'), ('', 'thomas.alva.edison#aaa.com'), ('Juan', 'juan#aaa.com'), ('Alan', 'alan#aaa.com'), ('Alec', 'alec#aaa.com'), ('Alejandro', 'aaa#aaa.com'), ('Alex', 'aaa#planeas.com'), ('Andrea', 'andrea.mery#thomsen.cl'), ('Andrea', 'andrea.22#aaa.com'), ('Andres', 'andres#aaa.com'), ('Andres', 'avaldivieso#aaa.com')]

parsing string - regex help in python

Hi, I have this string in Python:
'Every Wednesday and Friday, this market is perfect for lunch! Nestled in the Minna St. tunnel (at 5th St.), this location is great for escaping the fog or rain. Check out live music every Friday.\r\n\r\nLocation: 5th St. # Minna St.\r\nTime: 11:00am-2:00pm\r\n\r\nVendors:\r\nKasa Indian\r\nFiveten Burger\r\nHiyaaa\r\nThe Rib Whip\r\nMayo & Mustard\r\n\r\n\r\nCATERING NEEDS? Have OtG cater your next event! Get started by visiting offthegridsf.com/catering.'
I need to extract the following:
Location: 5th St. # Minna St.
Time: 11:00am-2:00pm
Vendors:
Kasa Indian
Fiveten Burger
Hiyaaa
The Rib Whip
Mayo & Mustard
I tried to do this by using:
val = desc.split("\r\n")
and then val[2] gives the location, val[3] gives the time and val[6:11] gives the vendors. But I am sure there is a nicer, more efficient way to do this.
Any help will be highly appreciated.
If your input is always going to formatted in exactly this way, using str.split() is preferable. If you want something slightly more resilient, here's a regex approach, using re.VERBOSE and re.DOTALL:
import re
desc_match = re.search(r'''(?sx)
(?P<loc>Location:.+?)[\n\r]
(?P<time>Time:.+?)[\n\r]
(?P<vends>Vendors:.+?)(?:\n\r?){2}''', desc)
if desc_match:
for gname in ['loc', 'time', 'vends']:
print desc_match.group(gname)
Given your definition of desc, this prints out:
Location: 5th St. # Minna St.
Time: 11:00am-2:00pm
Vendors:
Kasa Indian
Fiveten Burger
Hiyaaa
The Rib Whip
Mayo & Mustard
Efficiency really doesn't matter here because the time is going to be negligible either way (don't optimize unless there is a bottleneck.) And again, this is only "nicer" if it works more often than your solution using str.split() - that is, if there are any possible input strings for which your solution does not produce the correct result.
If you only want the values, just move the prefixes outside of the group definitions (a group is defined by (?P<group_name>...))
r'''(?sx)
Location: \s* (?P<loc>.+?) [n\r]
Time: \s* (?P<time>.+?) [\n\r]
Vendors: \s* (?P<vends>.+?) (?:\n\r?){2}'''
NLNL = "\r\n\r\n"
parts = s.split(NLNL)
result = NLNL.join(parts[1:3])
print(result)
which gives
Location: 5th St. # Minna St.
Time: 11:00am-2:00pm
Vendors:
Kasa Indian
Fiveten Burger
Hiyaaa
The Rib Whip
Mayo & Mustard

Categories