Extract some info from email using regular expression with Python - python

I need to parse the email file in elmx (Mac OS X email file format) to extract some information using regular expression with Python
The email contains the following format, and there are a lot of text before and after.
...
Name and Address (multi line)
Delivery estimate: SOMEDATE
BOOKNAME
AUTHOR and PRICE
SELLER
...
The example is as follows.
...
Engineer1
31500 N. Mopac Circle.
Company, Building A, 3K.A01
Dallas, TX 78759
United States
Delivery estimate: February 3, 2011
1 "Writing Compilers and Interpreters"
Ronald Mak; Paperback; $21.80
Sold by: Textbooksrus LLC
...
How can I parse the email to extract them? I normally use line = file.readline(); for line in lines, but in this case some of the info is multi-line (the address for example).
The thing is that those information is just one part of big file, so I need to find a way to detect them.

I don't think that you need regular expressions. You could probably do this by using readlines to load the file, then iterate over that looking for "Delivery estimate:" using the startswith() method in the string module. At that point, you have a line number where the data is located.
You can get the address by scanning backwards from the line number to find the block of text delimited by blank lines. Don't forget to use strip() when looking for blank lines.
Then do a forward scan from the delivery estimate line to pick up the other info.
Much faster than regular expressions too.

Do data = file.read() which will give you the whole shabang and then make sure to add line ends and start to your regex where needed.

You could split on the double \n\n and work from there:
>>> s= """
... Engineer1
... 31500 N. Mopac Circle.
... Company, Building A, 3K.A01
... Dallas, TX 78759
... United States
...
... Delivery estimate: February 3, 2011
...
... 1 "Writing Compilers and Interpreters"
... Ronald Mak; Paperback; $21.80
...
... Sold by: Textbooksrus LLC
... """
>>> name, estimate, author_price, seller = s.split("\n\n")
>>> print name
Engineer1
31500 N. Mopac Circle.
Company, Building A, 3K.A01
Dallas, TX 78759
United States

Related

How to extract string that contains specific characters in Python

I'm trying to extract ONLY one string that contains $ character. The input based on output that I extracted using BeautifulSoup.
Code
price = [m.split() for m in re.findall(r"\w+/$(?:\s+\w+/$)*", soup_content.find('blockquote', { "class": "postcontent restore" }).text)]
Input
For Sale is my Tag Heuer Carrera Calibre 6 with box and papers and extras.
39mm
47 ish lug to lug
19mm in between lugs
Pretty thin but not sure exact height. Likely around 12mm (maybe less)
I've owned it for about 2 years. I absolutely love the case on this watch. It fits my wrist and sits better than any other watch I've ever owned. I'm selling because I need cash and other pieces have more sentimental value
I am the second owner, but the first barely wore it.
It comes with barely worn blue leather strap, extra suede strap that matches just about perfectly and I'll include a blue Barton Band Elite Silicone.
I also purchased an OEM bracelet that I personally think takes the watch to a new level. This model never came with a bracelet and it was several hundred $ to purchase after the fact.
The watch was worn in rotation and never dropped or knocked around.
The watch does have hairlines, but they nearly all superficial. A bit of time with a cape cod cloth would take care of a lot it them. The pics show the imperfections in at "worst" possible angle to show the nature of scratches.
The bracelet has a few desk diving marks, but all in all, the watch and bracelet are in very good shape.
Asking $2000 obo. PayPal shipped. CONUS.
It's a big hard to compare with others for sale as this one includes the bracelet.
The output should be like this.
2000
You don't need a regex. Instead you can iterate over lines and over each word to check for starting with '$' and extract the word:
[word[1:] for line in s.split('\n') for word in line.split() if word.startswith('$') and len(word) > 1]
where s is your paragraph.
which outputs:
['2000']
Since this is very simple you don't need a regex solution, this should sufice:
words = text.split()
words_with_dollar = [word for word in words if '$' in word]
print(words_with_dollar)
>>> ['$', '$2000']
If you don't want the dollar sign alone, you can add a filter like this:
words_with_dollar = [word for word in words if '$' in word and '$' != word]
print(words_with_dollar)
>>> ['$2000']
I would do something like that (provided input is the string you wrote above)-
price_start = input.find('$')
price = input[price_start:].split(' ')[0]
IF there is only 1 occurrence like you said.
Alternative- you could use regex like that-
price = re.findall('\S*\$\S*\d', input)[0]
price = price.replace('$', '')

Regex to Remove Name, Address, Designation from an email text Python

I have a sample text of an email like this. I want to keep only the body of the text and remove names, address, designation, company name, email address from the text. So, to be clear, I only want the content of each mails between the From Dear/Hi/Hello to Sincerely/Regards/Thanks. How to do this efficiently using a regex or some other way
Subject: [EXTERNAL] RE: QUERY regarding supplement 73
Hi Roger,
Yes, an extension until June 22, 2018 is acceptable.
Regards,
Loren
Subject: [EXTERNAL] RE: QUERY regarding supplement 73
Dear Loren,
We had initial discussion with the ABC team us know if you would be able to extend the response due date to June 22, 2018.
Best Regards,
Mr. Roger
Global Director
roger#abc.com
78 Ford st.
Subject: [EXTERNAL] RE: QUERY regarding supplement 73
responding by June 15, 2018.check email for updates
Hello,
John Doe
Senior Director
john.doe#pqr.com
Subject: [EXTERNAL] RE: QUERY regarding supplement 73
Please refer to your January 12, 2018 data containing labeling supplements to add text regarding this
symptom. We are currently reviewing your supplements and have
made additional edits to your label.
Feel free to contact me with any questions.
Warm Regards,
Mr. Roger
Global Director
roger#abc.com
78 Ford st.
Center for Research
Office of New Discoveries
Food and Drug Administration
Loren#mno.com
From this text I only want as OUTPUT :
Subject: [EXTERNAL] RE: QUERY regarding supplement 73
Yes, an extension until June 22, 2018 is acceptable.
We had initial discussion with the ABC team us know if you would be able to extend the response due date to June 22, 2018.
responding by June 15, 2018.check email for updates
Please refer to your January 12, 2018 data containing labeling supplements to add text regarding this
symptom. We are currently reviewing your supplements and have
made additional edits to your label.
Feel free to contact me with any questions.
Below is an answer that works for your current input. The code will have to be adjusted when you process examples that fall outside the parameters outlined in the code below.
with open('email_input.txt') as input:
# List to store the cleaned lines
clean_lines = []
# Reads until EOF
lines = input.readlines()
# Remove some of the extra lines
no_new_lines = [i.strip() for i in lines]
# Convert the input to all lowercase
lowercase_lines = [i.lower() for i in no_new_lines]
# Boolean state variable to keep track of whether we want to be printing lines or not
lines_to_keep = False
for line in lowercase_lines:
# Look for lines that start with a subject line
if line.startswith('subject: [external]'):
# set lines_to_keep true and start capturing lines
lines_to_keep = True
# Look for lines that start with a salutation
elif line.startswith("regards,") or line.startswith("warm regards,") \
or line.startswith("best regards,") or line.startswith("hello,"):
# set lines_to_keep false and stop capturing lines
lines_to_keep = False
if lines_to_keep:
# regex to catch greeting lines
greeting_component = re.compile(r'(dear.*,|(hi.*,))', re.IGNORECASE)
remove_greeting = re.match(greeting_component, line)
if not remove_greeting:
if line not in clean_lines:
clean_lines.append(line)
for item in clean_lines:
print (item)
# output
subject: [external] re: query regarding supplement 73
yes, an extension until june 22, 2018 is acceptable.
we had initial discussion with the abc team us know if you would be able to
extend the response due date to june 22, 2018.
responding by june 15, 2018.check email for updates
please refer to your january 12, 2018 data containing labeling supplements
to add text regarding this symptom. we are currently reviewing your
supplements and have made additional edits to your label.
feel free to contact me with any questions.

BeautifulSoup: when pulling text from a section, <emph> and other tags are ignored causing adjacent words to be pushed together

I have an XML document. I want to pull all text between all .. <.p> tags. Below is an example of the text. The problem is that in a sentence like:
"Because the <emph>raspberry</emph> and.."
the output is "Because theraspberryand...". Somehow, the emph tags are being dropped (which is good) but being dropped in a way that pushes together the adjacent word.
Here is the relevant code I am using:
xml = BeautifulSoup(xml, convertEntities=BeautifulSoup.HTML_ENTITIES)
for para in xml.findAll('p'):
text = text + " " + para.text + " "
Here is a the start of part of the text, in case the full text helps:
<!DOCTYPE art SYSTEM "keton.dtd">
<art jid="PNAS" aid="1436" vid="94" iss="14" date="07-08-1997" ppf="7349" ppl="7355">
<fm>
<doctopic>Developmental Biology</doctopic>
<dochead>Inaugural Article</dochead>
<docsubj>Biological Sciences</docsubj>
<atl>Suspensor-derived polyembryony caused by altered expression of
valyl-tRNA synthetase in the <emph>twn2</emph>
mutant of <emph>Arabidopsis</emph></atl>
<prs>This contribution is part of the special series of Inaugural
Articles by members of the National Academy of Sciences elected on
April 30, 1996.</prs>
<aug>
<au><fnm>James Z.</fnm><snm>Zhang</snm></au>
<au><fnm>Chris R.</fnm><snm>Somerville</snm></au>
<fnr rid="FN150"><aff>Department of Plant Biology, Carnegie Institution of Washington,
290 Panama Street, Stanford CA 94305</aff>
</fnr></aug>
<acc>May 9, 1997</acc>
<con>Chris R. Somerville</con>
<pubfront>
<cpyrt><date><year>1997</year></date>
<cpyrtnme><collab>The National Academy of Sciences of the USA</collab></cpyrtnme></cpyrt>
<issn>0027-8424</issn><extent>7</extent><price>2.00/0</price>
</pubfront>
<fn id="FN150"><p>To whom reprint requests should be addressed. e-mail:
<email>crs#andrew.stanford.edu</email>.</p>
</fn>
<abs><p>The <emph>twn2</emph> mutant of <emph>Arabidopsis</emph>
exhibits a defect in early embryogenesis where, following one or two
divisions of the zygote, the decendents of the apical cell arrest. The
basal cells that normally give rise to the suspensor proliferate
abnormally, giving rise to multiple embryos. A high proportion of the
seeds fail to develop viable embryos, and those that do, contain a high
proportion of partially or completely duplicated embryos. The adult
plants are smaller and less vigorous than the wild type and have a
severely stunted root. The <emph>twn2-1</emph> mutation, which is the
only known allele, was caused by a T-DNA insertion in the 5′
untranslated region of a putative valyl-tRNA synthetase gene,
<it>valRS</it>. The insertion causes reduced transcription of the
<it>valRS</it> gene in reproductive tissues and developing seeds but
increased expression in leaves. Analysis of transcript initiation sites
and the expression of promoter–reporter fusions in transgenic plants
indicated that enhancer elements inside the first two introns interact
with the border of the T-DNA to cause the altered pattern of expression
of the <it>valRS</it> gene in the <emph>twn2</emph> mutant. The
phenotypic consequences of this unique mutation are interpreted in the
context of a model, suggested by Vernon and Meinke &lsqbVernon, D. M. &
Meinke, D. W. (1994) <emph>Dev. Biol.</emph> 165, 566–573&rsqb, in
which the apical cell and its decendents normally suppress the
embryogenic potential of the basal cell and its decendents during early
embryo development.</p>
</abs>
</fm>
I think the problem here is that you're trying to write bs4 code with bs3.
The obvious fix is to the use bs4 instead.
But in bs3, the docs show two ways to get all of the text recursively from all contents of a soup:
''.join(e for e in soup.recursiveChildGenerator() if isinstance(e, unicode))
''.join(soup.findAll(text=True))
You can obviously change either one of those to explicitly strip whitespace off the edges and add exactly one space between each node instead of relying on whatever space might be there:
' '.join(e.strip() for e in soup.recursiveChildGenerator() if isinstance(e, unicode))
' '.join(map(str.strip, soup.findAll(text=True)))
I wouldn't want to guarantee that this will be exactly the same as the bs4 text property… but I think it's what you want here.

Python multiline regex search between sections

I am trying to sort data coming from an online plain text government report that looks something like this:
Potato Prices as of 24-SEP-2014
Idaho
BrownSpuds
SomeSpuds 1.90-3.00 mostly 2.00-2.50
MoreSpuds 2.50-3.50
LotofSpuds 5.00-6.50
Washington
RedSpuds
TinyReds 1.50-2.00
BigReds 2.00-3.50
BrownSpuds
SomeSpuds 1.50-2.50
MoreSpuds 3.00-3.50
LotofSpuds 5.50-6.50
BulkSpuds 1.00-2.50
Long Island
SomeSpuds 1.50-2.50 MoreSpuds 2.70-3.75 LotofSpuds 5.00-6.50
etc...
I included the inconsistent indents and line breaks intentionally. This is a government operation.
But I need a function that can look up the price for "MoreSpuds" in Idaho, for example, or "TinyReds" in Washington. I have an inkling that this is a job for Regex, but I can't figure out how to search multiple lines between "Idaho" and "Washington".
EDIT: Adding the following difficulty. A particular item isn't always present in a given state. For example, "RedSpuds" in Washington might go out of season before "RedSpuds" in another state. I need the search to end before it reaches the next state, giving me no price at all, if the item isn't listed.
I also just ran into a case where the prices were written in a paragraph instead of a list. Sort of like the last example, but the actual product names are a lot longer, such as "One baled 10 5-lb sacks sz A 10.00-10.50" so some of the names get split between lines, meaning there might be a newline anywhere in the middle of the name.
Use DOTALL modifier (?s) to make dot to match even new line characters also.
>>> import re
>>> s = """Potato Prices as of 24-SEP-2014
... Idaho
... BrownSpuds
... SomeSpuds 1.90-3.00 mostly 2.00-2.50
... MoreSpuds 2.50-3.50
... LotofSpuds 5.00-6.50
...
... Washington
...
... RedSpuds
... TinyReds 1.50-2.00
... BigReds 2.00-3.50
... BrownSpuds
... SomeSpuds 1.50-2.50
... MoreSpuds 3.00-3.50
... LotofSpuds 5.50-6.50
... BulkSpuds 1.00-2.50
...
... Long Island
... SomeSpuds 1.50-2.50 MoreSpuds 2.70-3.75 LotofSpuds 5.00-6.50"""
To get the price of MoreSpuds in Idaho,
>>> m = re.search(r'(?s)\bIdaho\n*(?:(?!\n\n).)*?MoreSpuds\s+(\S+)', s)
>>> m.group(1)
'2.50-3.50'
To get the price of TinyReds in Washington,
>>> m = re.search(r'(?s)\bWashington\n*(?:(?!\n\n).)*?TinyReds\s+(\S+)', s)
>>> m.group(1)
'1.50-2.00'
DEMO
Pattern Explanation:
(?s) DOTALL modifier.
\b Word boundary which matches between a word and non-word character.
Washington City name.
\n* Matches zero or more new line characters.
(?:(?!\n\n).)*? This negative lookahead within a non-capturing group asserts that match any but not of a \n\n(a blank line). ? after the * forces the regex engine to do a shortest possible match.
TinyReds Product name.
\s+ Matches one or more space characters.
(\S+) Following one or more non-space characters are captured into group 1.

How can I organize each scraped item into a csv row?

What is the best way to organize scraped data into a csv? More specifically each item is in this form
url
"firstName middleInitial, lastName - level - word1 word2 word3, & wordN practice officeCity."
JD, schoolName, date
Example:
http://www.examplefirm.com/jang
"Joe E. Ang - partner - privatization mergers, media & technology practice New York."
JD, University of Chicago Law School, 1985
I want to put this item in this form:
(http://www.examplefirm.com/jang, Joe, E., Ang, partner, privatization mergers, media & technology, New York, University of Chicago Law School, 1985)
so that I can write it into a csv file to import to a django db.
What would be the best way of doing this?
Thank you.
There's really no short cut on this. Line 1 is easy. Just assign it to url. Line 3 can probably be split on , without any ill effects, but line 2 will have to be manually parsed. What do you know about word1-wordN? Are you sure "practice" will never be a "word". Are you sure the words are only one word long? Can they be quoted? Can they contain dashes?
Then I would parse out the beginning and end bits, so you're left with a list of words, split it by commas and/or & (is there a consistent comma before &? Your format says yes, but your example says no.) If there are a variable number of words, you don't want to inline them in your tuple like that, because you don't know how to get them out. Create a list from your words, and add that as one element of the tuple.
>>> tup = (url, first, middle, last, rank, words, city, school, year)
>>> tup
('http://www.examplefirm.com/jang', 'Joe', 'E.', 'Ang', 'partner',
['privatization mergers', 'media & technology'], 'New York',
'University of Chicago Law School', '1985')
More specifically? You're on your own there.

Categories