Regex to Remove Name, Address, Designation from an email text Python - python

I have a sample text of an email like this. I want to keep only the body of the text and remove names, address, designation, company name, email address from the text. So, to be clear, I only want the content of each mails between the From Dear/Hi/Hello to Sincerely/Regards/Thanks. How to do this efficiently using a regex or some other way
Subject: [EXTERNAL] RE: QUERY regarding supplement 73
Hi Roger,
Yes, an extension until June 22, 2018 is acceptable.
Regards,
Loren
Subject: [EXTERNAL] RE: QUERY regarding supplement 73
Dear Loren,
We had initial discussion with the ABC team us know if you would be able to extend the response due date to June 22, 2018.
Best Regards,
Mr. Roger
Global Director
roger#abc.com
78 Ford st.
Subject: [EXTERNAL] RE: QUERY regarding supplement 73
responding by June 15, 2018.check email for updates
Hello,
John Doe
Senior Director
john.doe#pqr.com
Subject: [EXTERNAL] RE: QUERY regarding supplement 73
Please refer to your January 12, 2018 data containing labeling supplements to add text regarding this
symptom. We are currently reviewing your supplements and have
made additional edits to your label.
Feel free to contact me with any questions.
Warm Regards,
Mr. Roger
Global Director
roger#abc.com
78 Ford st.
Center for Research
Office of New Discoveries
Food and Drug Administration
Loren#mno.com
From this text I only want as OUTPUT :
Subject: [EXTERNAL] RE: QUERY regarding supplement 73
Yes, an extension until June 22, 2018 is acceptable.
We had initial discussion with the ABC team us know if you would be able to extend the response due date to June 22, 2018.
responding by June 15, 2018.check email for updates
Please refer to your January 12, 2018 data containing labeling supplements to add text regarding this
symptom. We are currently reviewing your supplements and have
made additional edits to your label.
Feel free to contact me with any questions.

Below is an answer that works for your current input. The code will have to be adjusted when you process examples that fall outside the parameters outlined in the code below.
with open('email_input.txt') as input:
# List to store the cleaned lines
clean_lines = []
# Reads until EOF
lines = input.readlines()
# Remove some of the extra lines
no_new_lines = [i.strip() for i in lines]
# Convert the input to all lowercase
lowercase_lines = [i.lower() for i in no_new_lines]
# Boolean state variable to keep track of whether we want to be printing lines or not
lines_to_keep = False
for line in lowercase_lines:
# Look for lines that start with a subject line
if line.startswith('subject: [external]'):
# set lines_to_keep true and start capturing lines
lines_to_keep = True
# Look for lines that start with a salutation
elif line.startswith("regards,") or line.startswith("warm regards,") \
or line.startswith("best regards,") or line.startswith("hello,"):
# set lines_to_keep false and stop capturing lines
lines_to_keep = False
if lines_to_keep:
# regex to catch greeting lines
greeting_component = re.compile(r'(dear.*,|(hi.*,))', re.IGNORECASE)
remove_greeting = re.match(greeting_component, line)
if not remove_greeting:
if line not in clean_lines:
clean_lines.append(line)
for item in clean_lines:
print (item)
# output
subject: [external] re: query regarding supplement 73
yes, an extension until june 22, 2018 is acceptable.
we had initial discussion with the abc team us know if you would be able to
extend the response due date to june 22, 2018.
responding by june 15, 2018.check email for updates
please refer to your january 12, 2018 data containing labeling supplements
to add text regarding this symptom. we are currently reviewing your
supplements and have made additional edits to your label.
feel free to contact me with any questions.

Related

Regex matches fine in tester but not in Python code

I'd like to remove text between the strings "Criteria Details" and both "\n{Some number}\n" or "\nPage {Some number}\n". My code is below:
test = re.search(r'Criteria Details[\w\s\S]*?(\n[0-9]+\n|\nPAGE [0-9]+\n)', input_text)
print(test)
input_text = re.sub(r'Criteria Details[\w\s\S]*?(\n[0-9]+\n|\nPAGE [0-9]+\n)', ' ', input_text, flags=re.IGNORECASE)
This works on regex101 for the string below, as I can see that the chunk between "Criteria Details" and "88" is detected, but the .search() in my code doesn't return anything, and nothing is replaced in .sub(). Am I missing something?
cyclobenzaprine oral tablet 10 mg, 5 mg,
7.5 mg
PA Criteria
Criteria Details
N/A
N/A
other
N/A
Exclusion
Criteria
Required
Medical
Information
Prescriber
Restrictions
Coverage
Duration
Other Criteria
Age Restrictions Patients aged less than 65 years, approve. Patients aged 65 years and older,
End of the Contract Year
PA does NOT apply to patients less than 65 yrs of age. High Risk
Medications will be approved if ALL of the following are met: a. Patient
has an FDA-approved diagnosis or CMS-approved compendia accepted
indication for the requested high risk medication AND b. the prescriber
has completed a risk assessment of the high risk medication for the patient
and has indicated that the benefits of the requested high risk medication
outweigh the risks for the patient AND c.Prescriber has documented that
s/he discussed risks and potential side effects of the medication with the
patient AND d. if patient is taking conconmitantly a muscle relaxant with
an opioid, the prescriber indicated that the benefits of the requested
combination therapy outweigh the risks for the patient.
Indications
All Medically-accepted Indications.
Off-Label Uses
N/A
88
Updated 06/2020
I would expect the output to be something like
cyclobenzaprine oral tablet 10 mg, 5 mg,
7.5 mg
PA Criteria
Updated 06/2020
You got it, just a silly mistake. Change your code to this
input_text = re.sub(r'Criteria Details[\w\s\S]*?(\n[0-9]+\n|\nPAGE [0-9]+\n)', ' ', input_text, flags=re.IGNORECASE)
print(input_text)
Where you went wrong is
input_text = re.sub(r'Criteria Details[\w\s\S]*?(\n[0-9]+\n|\nPAGE [0-9]+\n)', ' ', input_text, flags=re.IGNORECASE) # This is the necessary replacement well done
test = re.search(r'Criteria Details[\w\s\S]*?(\n[0-9]+\n|\nPAGE [0-9]+\n)', input_text) # This extracts a pattern which will never be found because you already removed it
print(test) # The result of the previous line which would never be found
Hope this helps! We all have bad days 😀
I figured it out. When using Pdfminer to parse the PDF into text, there aren't actually newlines after the page number, but they get converted into newlines if I copy and paste the output to the regex website, or Stackoverflow. I ended up using \s instead of \n to detect the trailing spaces after the page numbers.

How to split text into paragraphs using NLTK nltk.tokenize.texttiling?

I found this Split Text into paragraphs NLTK - usage of nltk.tokenize.texttiling? explaining how to feed a text into texttiling, however I am unable to actually return a text tokenized by paragraph / topic change as shown here under texttiling http://www.nltk.org/api/nltk.tokenize.html.
When I feed my text into texttiling, I get the same untokenized text back, but as a list, which is of no use to me.
tt = nltk.tokenize.texttiling.TextTilingTokenizer(w=20, k=10,similarity_method=0, stopwords=None, smoothing_method=[0], smoothing_width=2, smoothing_rounds=1, cutoff_policy=1, demo_mode=False)
tiles = tt.tokenize(text) # same text returned
What I have are emails that follow this basic structure
From: X
To: Y (LOGISTICS)
Date: 10/03/2017
Hello team, (INTRO)
Some text here representing
the body (BODY)
of the text.
Regards, (OUTRO)
X
*****DISCLAIMER***** (POST EMAIL DISCLAIMER)
THIS EMAIL IS CONFIDENTIAL
IF YOU ARE NOT THE INTENDED RECIPIENT PLEASE DELETE THIS EMAIL
If we call this email string s, it would look like
s = "From: X\nTo: Y\nDate: 10/03/2017 Hello team,\nSome text here representing the body of the text. Regards,\nX\n\n*****DISCLAIMER*****\nTHIS EMAIL IS CONFIDENTIAL\nIF YOU ARE NOT THE INTENDED RECIPIENT PLEASE DELETE THIS EMAIL"
What I want to do is return these 5 sections / paragraphs of string s - LOGISTICS, INTRO, BODY, OUTRO, POST EMAIL DISCLAIMER - separately so I can remove everything but the BODY of the text. How can I return these 5 sections separately using nltk texttiling?
*** Not all emails follow this same structure or have the same wording, so I can't use regular expressions.
What about using splitlines? Or do you have to use the nltk package?
email = """ From: X
To: Y (LOGISTICS)
Date: 10/03/2017
Hello team, (INTRO)
Some text here representing
the body (BODY)
of the text.
Regards, (OUTRO)
X
*****DISCLAIMER***** (POST EMAIL DISCLAIMER)
THIS EMAIL IS CONFIDENTIAL
IF YOU ARE NOT THE INTENDED RECIPIENT PLEASE DELETE THIS EMAIL"""
y = [s.strip() for s in email.splitlines()]
print(y)
What I want to do is return these 5 sections / paragraphs of string s - LOGISTICS, INTRO, BODY, OUTRO, POST EMAIL DISCLAIMER - separately so I can remove everything but the BODY of the text. How can I return these 5 sections separately using nltk texttiling?
The texttiling algorithm {1,4,5} isn't designed to perform sequential text classification {2,3} (which is the task you described). Instead, from http://people.ischool.berkeley.edu/~hearst/research/tiling.html:
TextTiling is [an unsupervised] technique for automatically subdividing texts into multi-paragraph units that represent passages, or subtopics.
References:
{1} Marti A. Hearst, Multi-Paragraph Segmentation of Expository Text. Proceedings of the 32nd Meeting of the Association for Computational Linguistics, Los Cruces, NM, June, 1994. pdf
{2} Lee, J.Y. and Dernoncourt, F., 2016, June. Sequential Short-Text Classification with Recurrent and Convolutional Neural Networks. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (pp. 515-520). https://www.aclweb.org/anthology/N16-1062.pdf
{3} Dernoncourt, Franck, Ji Young Lee, and Peter Szolovits. "Neural Networks for Joint Sentence Classification in Medical Paper Abstracts." In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, pp. 694-700. 2017. https://www.aclweb.org/anthology/E17-2110.pdf
{4} Hearst, M. TextTiling: Segmenting Text into Multi-Paragraph Subtopic Passages, Computational Linguistics, 23 (1), pp. 33-64, March 1997. pdf
{5} Pevzner, L., and Hearst, M., A Critique and Improvement of an Evaluation Metric for Text Segmentation, Computational Linguistics, 28 (1), March 2002, pp. 19-36. pdf

python to extract list match from email thread

I m new to python. I need to retrieve the list of match
for Example my text is below which is an email.
I need to extract all To, From, Sent, Subject and body from a mail thread.
Result need to From List
From(1) = Crandall, Sean
From(2) = Nettelton, Marcus
To(1)= Crandall, Sean; Badeer, Robert
To(2)= Meredith, Kevin
Like for above Sent, subject etc
"-----Original Message-----
From: Crandall, Sean
Sent: Wednesday, May 23, 2001 2:56 PM
To: Meredith, Kevin
Subject: RE: Spreads and Product long desc.
Kevin,
Is the SP and NP language in the spread language the same language we use when we transact SP15 or NP15 on eol?
-----Original Message-----
From: Meredith, Kevin
Sent: Wednesday, May 23, 2001 11:16 AM
To: Crandall, Sean; Badeer, Robert
Subject: FW: Spreads and Product long desc."
You can use re.findall() for this, see: https://docs.python.org/2/library/re.html#re.findall. E.g.
re.findall("From: (.*) ", input_string);
would return a list of the From-names (['Crandall, Sean', 'Meredith, Kevin']), assuming it's always the same amount of white spaces.
If you want to get fancy, you could do several searches in the same expression: E.g.
re.findall("From: (.*) \nSent: (.*)", input_string);
would return [('Crandall, Sean', 'Wednesday, May 23, 2001 2:56 PM'), ('Meredith, Kevin', 'Wednesday, May 23, 2001 11:16 AM')]
If you don't know how to use regex and as your problem is not that tough, you may consider to use the split() and replace() functions.
Here are some lines of code that might be a good start:
mails = """-----Original Message-----
From: Crandall, Sean
Sent: Wednesday, May 23, 2001 2:56 PM
To: Meredith, Kevin
Subject: RE: Spreads and Product long desc.
Kevin,
Is the SP and NP language in the spread language the same language we use when we transact SP15 or NP15 on eol?
-----Original Message-----
From: Meredith, Kevin
Sent: Wednesday, May 23, 2001 11:16 AM
To: Crandall, Sean; Badeer, Robert
Subject: FW: Spreads and Product long desc."""
mails_list = mails.split("-----Original Message-----\n")
mails_from = []
mails_sent = []
mails_to = []
mails_subject = []
mails_body = []
for mail in mails_list:
if not mail:
continue
inter = mail.split("From: ")[1].split("\nSent: ")
mails_from.append(inter[0])
inter = inter[1].split("\nTo: ")
mails_sent.append(inter[0])
inter = inter[1].split("\nSubject: ")
mails_to.append(inter[0])
inter = inter[1].split("\n")
mails_subject.append(inter[0])
mails_body.append(inter[0])
See how this only use really basic concepts.
Here are some points that you might need to consider:
Try by yourself, you might need some adjustments.
With that method, the parsing method is quite tough, the format of the mails must be really accurate.
There might be some space that you want to remove, for example with the replace() method.

Python extract both names *AND* emails from body using regex in a single swoop

Python3
I need help creating a regex to extract names and emails from a forwarded email body, which will look similar to this always (real emails replaced by dummy emails):
> Begin forwarded message:
> Date: December 20, 2013 at 11:32:39 AM GMT-3
> Subject: My dummy subject
> From: Charlie Brown <aaa#aa-aaa.com>
> To: maria.brown#aaa.com, George Washington <george#washington.com>, =
thomas.jefferson#aaa.com, thomas.alva.edison#aaa.com, Juan =
<juan#aaa.com>, Alan <alan#aaa.com>, Alec <alec#aaa.com>, =
Alejandro <aaa#aaa.com>, Alex <aaa#planeas.com>, Andrea =
<andrea.mery#thomsen.cl>, Andrea <andrea.22#aaa.com>, Andres =
<andres#aaa.com>, Andres <avaldivieso#aaa.com>
> Hi,
> Please reply ASAP with your RSVP
> Bye
My first step was extracting all emails to a list with a custom function that I pass the whole email body to, like so:
def extract_emails(block_of_text):
t = r'\b[a-zA-Z0-9.-]+#[a-zA-Z0-9.-]+\.[a-zA-Z0-9.-]+\b'
return re.findall(t, block_of_text)
A couple of days ago I asked a question about extracting names using regex to help me build the function to extract all the names. My idea was to join both later on. I accepted an answer that performed what I asked, and came up with this other function:
def extract_names(block_of_text):
p = r'[:,] ([\w ]+) \<'
return re.findall(p, block_of_text)
My problem now was to make the extracted names match the extracted emails, mainly because sometimes there are less names than emails. So I thought, I could better try to build another regex to extract both names and emails,
This is my failed attempt to build such a regex.
[:,]([\w \<]+)([\w.-]+#[\w.-]+\.[\w.-]+)
REGEX101 LINK
Can anyone help and propose a nice, clean regex that grabs both name and email, to a list or dictionary of tuples? Thanks
EDIT:
The expected output of the regex in Python would be a list like this:
[(Charlie Brown', 'aaa#aaa.com'),('','maria.brown#aaa.com'),('George Washington', 'george#washington.com'),('','thomas.jefferson#aaa.com'),('','thomas.alva.edison#aaa.com'),('Juan','juan#aaa.com',('Alan', 'alan#aaa.com'), ('Alec', 'alec#aaa.com'),('Alejandro','aaa#aaa.com'),('Alex', 'aaa#aaa.com'),('Andrea','andrea.mery#thomsen.cl'),('Andrea','andrea.22#aaa.com',('Andres','andres#aaa.com'),('Andres','avaldivieso#aaa.com')]
Seems like you want something like this.,
[:,]\s*=?\s*(?:([A-Z][a-z]+(?:\s[A-Z][a-z]+)?))?\s*=?\s*.*?([\w.]+#[\w.-]+)
DEMO
>>> import re
>>> s = """ > Begin forwarded message:
>=20
> Date: December 20, 2013 at 11:32:39 AM GMT-3
> Subject: My dummy subject
> From: Charlie Brown <aaa#aa-aaa.com>
> To: maria.brown#aaa.com, George Washington <george#washington.com>, =
thomas.jefferson#aaa.com, thomas.alva.edison#aaa.com, Juan =
<juan#aaa.com>, Alan <alan#aaa.com>, Alec <alec#aaa.com>, =
Alejandro <aaa#aaa.com>, Alex <aaa#planeas.com>, Andrea =
<andrea.mery#thomsen.cl>, Andrea <andrea.22#aaa.com>, Andres =
<andres#aaa.com>, Andres <avaldivieso#aaa.com>
> Hi,
> Please reply ASAP with your RSVP
> Bye"""
>>> re.findall(r'[:,]\s*=?\s*(?:([A-Z][a-z]+(?:\s[A-Z][a-z]+)?))?\s*=?\s*.*?([\w.]+#[\w.-]+)', s)
[('Charlie Brown', 'aaa#aa-aaa.com'), ('', 'maria.brown#aaa.com'), ('George Washington', 'george#washington.com'), ('', 'thomas.jefferson#aaa.com'), ('', 'thomas.alva.edison#aaa.com'), ('Juan', 'juan#aaa.com'), ('Alan', 'alan#aaa.com'), ('Alec', 'alec#aaa.com'), ('Alejandro', 'aaa#aaa.com'), ('Alex', 'aaa#planeas.com'), ('Andrea', 'andrea.mery#thomsen.cl'), ('Andrea', 'andrea.22#aaa.com'), ('Andres', 'andres#aaa.com'), ('Andres', 'avaldivieso#aaa.com')]

Extract some info from email using regular expression with Python

I need to parse the email file in elmx (Mac OS X email file format) to extract some information using regular expression with Python
The email contains the following format, and there are a lot of text before and after.
...
Name and Address (multi line)
Delivery estimate: SOMEDATE
BOOKNAME
AUTHOR and PRICE
SELLER
...
The example is as follows.
...
Engineer1
31500 N. Mopac Circle.
Company, Building A, 3K.A01
Dallas, TX 78759
United States
Delivery estimate: February 3, 2011
1 "Writing Compilers and Interpreters"
Ronald Mak; Paperback; $21.80
Sold by: Textbooksrus LLC
...
How can I parse the email to extract them? I normally use line = file.readline(); for line in lines, but in this case some of the info is multi-line (the address for example).
The thing is that those information is just one part of big file, so I need to find a way to detect them.
I don't think that you need regular expressions. You could probably do this by using readlines to load the file, then iterate over that looking for "Delivery estimate:" using the startswith() method in the string module. At that point, you have a line number where the data is located.
You can get the address by scanning backwards from the line number to find the block of text delimited by blank lines. Don't forget to use strip() when looking for blank lines.
Then do a forward scan from the delivery estimate line to pick up the other info.
Much faster than regular expressions too.
Do data = file.read() which will give you the whole shabang and then make sure to add line ends and start to your regex where needed.
You could split on the double \n\n and work from there:
>>> s= """
... Engineer1
... 31500 N. Mopac Circle.
... Company, Building A, 3K.A01
... Dallas, TX 78759
... United States
...
... Delivery estimate: February 3, 2011
...
... 1 "Writing Compilers and Interpreters"
... Ronald Mak; Paperback; $21.80
...
... Sold by: Textbooksrus LLC
... """
>>> name, estimate, author_price, seller = s.split("\n\n")
>>> print name
Engineer1
31500 N. Mopac Circle.
Company, Building A, 3K.A01
Dallas, TX 78759
United States

Categories