How to parse email body in Python? [duplicate]

How to parse email body in Python? [duplicate] - python

I want to retrieve body (only text) of emails using python imap and email package.
As per this SO thread, I'm using the following code:
mail = email.message_from_string(email_body)
bodytext = mail.get_payload()[ 0 ].get_payload()
Though it's working fine for some instances, but sometime I get similar to following response
[<email.message.Message instance at 0x0206DCD8>, <email.message.Message instance at 0x0206D508>]

You are assuming that messages have a uniform structure, with one well-defined "main part". That is not the case; there can be messages with a single part which is not a text part (just an "attachment" of a binary file, and nothing else) or it can be a multipart with multiple textual parts (or, again, none at all) and even if there is only one, it need not be the first part. Furthermore, there are nested multiparts (one or more parts is another MIME message, recursively).
In so many words, you must inspect the MIME structure, then decide which part(s) are relevant for your application. If you only receive messages from a fairly static, small set of clients, you may be able to cut some corners (at least until the next upgrade of Microsoft Plague hits) but in general, there simply isn't a hierarchy of any kind, just a collection of (not necessarily always directly related) equally important parts.

The main problem in my case is that replied or forwarded message shown as message instance in the bodytext.
Solved my problem using the following code:
bodytext=mail.get_payload()[0].get_payload();
if type(bodytext) is list:
bodytext=','.join(str(v) for v in bodytext)

My external lib: https://github.com/ikvk/imap_tools
from imap_tools import MailBox
# get list of email bodies from INBOX folder
with MailBox('imap.mail.com').login('test#mail.com', 'pwd', 'INBOX') as mailbox:
bodies = [msg.text or msg.html for msg in mailbox.fetch()]

Maybe this post (of mine) can be of help. I receive a Newsletter with prices of different kind of oil in the US. I fetch email in gmail with a given pattern for the title, then I extract the prices in the mail body using regex. So i have to access the mail body for the last n emails which title observe given pattern.
I am using email.message_from_string() also: msg = email.message_from_string(response_part[1])
so maybe it gives you concrete example of how to use methods in this python lib.

Related

Trigger an email when some a certain file becomes available in some location

newbie here.
I have been learning Python for 3 months now because my team uses it to handle big data. I am almost done fully automating the end-to-end process of producing our monthly financials, except for one crucial step:
I need to trigger an email when the extract becomes available.
Basically, I want to do this:
if file F exists in folder Downstream:
Then send email to self.
I will then confirm running the entire process by replying some code word to this email.
TIA!

You may use Python's email library but usually, the examples are not really straightforward and simple especially if you are new. For this reason, I actually made a package that most likely is enough for your needs regarding sending emails.
Just pip install it:
pip install redmail
Then to solve your problem:
from pathlib import Path
from redmail import EmailSender
email = EmailSender(
host="<YOUR SMTP SERVER>",
port=1,
user_name="you#example.com", # If not needed, don't pass this
password="<YOUR PASSWORD>" # If not needed, don't pass this
)
# Check if file exists
if Path("path/to/file.csv").is_file():
# File does exists, send the email
email.send(
receivers=["you#example.com"],
subject="File is now found",
)
If you want to include a message, pass text and/or html parameters and write the email message as a string. You may also set the receivers as a default (attribute of the email instance) if repeatedly send you an email in different sections of your code.
Red Mail is well tested and well documented open-source library with features such as including attachments, embedding images, templating text or HTML and preformatted errors.
Documentation: https://red-mail.readthedocs.io/en/latest/
Source code: https://github.com/Miksus/red-mail
Note that pathlib is from standard library. It is a handy library for manipulating file paths. You could also use os.path.exists if you prefer that.

Parse email and send reply using googleapiclient in Python

I'm currently working on a project and I have chosen to use Gmail for sending and receiving emails. I want to be able to send an email, have a user reply to it, and parse their response. The response can be any number of lines (so something like response.split('\n')[0] won't work). It should then be able to reply directly to that email thread.
I've been following the googleapiclient tutorials, but they leave a lot to be desired. However, I've managed to read email threads using:
service.users.threads().get(userId='me', id=thread_id).execute()
where thread_id is (predictably) the ID of the email thread (which I find elsewhere). In the large dict returned by this, there is a section of base64 data which contains the content of the email. This was the only place I could find the actual data for the response. Unfortunately, I get this when it is decoded:
b'This is my response from my phone\r\n\r\nOn Sat, 28 Nov 2020, 8:40 PM , <myemail#gmail.com>\r\nwrote:\r\n\r\n> This is sent from the python script\r\n>\r\n'
This is all the data in the thread, however, I only want the response as there is clearly no way to split this to get only the data I need. The best I can think of is to parse out anything of the form On <date>, <time>, but that could lead to problems. There must be another way to extract only This is my response from my phone and no other data.
Once I get the response, I want to parse it and reply with an appropriate response based on the contents of the message. I would prefer to reply directly to the thread, rather than starting a new one. Unfortunately, all the Google documentation says is:
If you're trying to send a reply and want the email to thread, make sure that:
The Subject headers match
The References and In-Reply-To headers follow the RFC 2822 standard.
The documentation provides this code (with some minor modifications by me) for sending an email:
def create_message(sender, to, subject, message_text):
message = MIMEText(message_text)
message['to'] = to
message['from'] = sender
message['subject'] = subject
return {'raw': base64.urlsafe_b64encode(message.as_bytes()).decode()}
Sending a reply with the same subject line is pretty straight forward (message['subject'] = same_subject_as_before), but I don't even know where to start with the References and In-Reply-To headers. How do I set these?

Why is this hard?
You are trying to use e-mail for something it simply wasn't originally designed for. My impression is you want the e-mail response to contain structured data, but e-mail text lacks any well-defined structure. It also depends on which e-mail client the other user has, and whether they send HTML e-mail or not.
This is usually easy for a human to see, but difficult for a computer. Which suggests that Machine Learning might be the best strategy if you want higher reliability. Whatever solution you choose, it's not going to be 100% reliable.
E-mail can be plain text or HTML, or both.
There is no well-defined structure to separate replies from the original text. Wikipedia lists a few different "posting styles".
In the old days when "Netiquette" was still cool, putting your reply on top ("top-posting") was considered bad practice, and new Internet users were told by old folks to avoid top-posting. Some users still reply below or interleaved with the original text.
The reply line (e.g. "On DATE, EMAIL wrote:" or "-------- Original Message --------") will be different, depending on which e-mail client is used, what language that client is set to, and the user's own preferences.
Using a text delimiter
A class of software which faces a similar problem as the one you describe is customer service applications, which allow operators to use e-mail for communication. A common strategy is to inject some unique text in your templates for outgoing e-mail. For example, Zendesk uses a text "delimiter" such as:
##- Please type your reply above this line -##
This serves two purposes; it tells users to top-post, and it provides a separator to cut out most of the irrelevant text.
If you first handle any HTML encoding, you should be able to split the message by such a text delimiter. It's not perfect, but it usually works.
Use products made by others
There are some open source options, such as:
https://github.com/zapier/email-reply-parser
And I found a commercial product, SigParser, which seems to use a machine learning model that they've trained very carefully:
https://sigparser.com/developers/extract-reply-chains-from-emails/
They also explain some of the challenges of parsing e-mail text into structured data.

Is it possible to sort through all emails in Outlook inbox from a specific sender using python?

I need to sort through my corporate outlook account and want to sort through all of my emails that were sent by a certain address and locate all replies to that email. My understanding is that I can use the win32com.client module to access my outlook and am able to read all of the "unread" emails in the folder. However, I want to change the filter to read emails according to the specific sender. I can't seem to find a comprehensive list of methods than can be called on my messages object. Can you specify a sender?
Here is my code so far:
import win32com.client
outlook = win32com.client.Dispatch("Outlook.Application").GetNamespace("MAPI")
inbox = outlook.GetDefaultFolder(6) # "6" refers to the index of a folder - in this case,
# the inbox. You can change that number to reference
# any other folder
messages = inbox.Items
messages.Sort("[ReceivedTime]",True)
sender = 'xxxxx#xxxx.com'
for message in messages:
if sender in message:
print (message.body)

Iterating over all items in a folder is not really a good idea. Instead, you need to use the Find/FindNext or Restrict methods of the Items class in Outlook. Read more about them in the following articles:
How To: Use Find and FindNext methods to retrieve Outlook mail items from a folder (C#, VB.NET)
How To: Use Restrict method to retrieve Outlook mail items from a folder
For example, you can use the following search criteria:
outItems = Items.Restrict("[SenderEmailAddress] = " & "'" & address & "'")
Also, you may find the AdvancedSearch method of the Application class helpful. The key benefits of using the `AdvancedSearch method in Outlook are:
The search is performed in another thread. You don’t need to run another thread manually since the AdvancedSearch method runs it automatically in the background.
Possibility to search for any item types: mail, appointment, calendar, notes etc. in any location, i.e. beyond the scope of a certain folder. The Restrict and Find/FindNext methods can be applied to a particular Items collection (see the Items property of the Folder class in Outlook).
Full support for DASL queries (custom properties can be used for searching too). You can read more about this in the Filtering article in MSDN. To improve the search performance, Instant Search keywords can be used if Instant Search is enabled for the store (see the IsInstantSearchEnabled property of the Store class).
You can stop the search process at any moment using the Stop method of the Search class.

Filter EWS mailbox by recipient address using exchangelib

I'm writing a monitoring solution using python3 with exchangelib and trying to count messages in our team's mailbox. One of the criteria: recipient list must contain specific email address.
When i use filter() with author or subject arguments script is working fine and return correct results.
But when i tried to filter by to_recipients or to_recipients__contains (which is list-type field), script throws an exception:
ValueError: EWS does not support filtering on field 'to_recipients'
Is there a way to filter mailbox by recipient email_address, avoiding to fetch all messages and than filtering it on the client side?

[exchangelib maintainer here]
I don't think there is. You could try to flip the is_searchable flag on that field and search anyway, but I never could get filtering to work in my tests. I can't remember if it throws server errors, returns all items anyway, or returns an empty list.
I'm happy to accept patches it you do find a solution.

Email title and link from rss-feed and email them

I'm doing a bit of an experiment in Python. I'm making a script which checks a rss-feed for new items, and then sends the title and link of the items via email. I've got the script to work to a certain level: when it runs it will take the link+title of the newest item and email it, regardless of wether it emailed that file already or not. I'd need to add 2 things: a way to get multiple items at once (and email those, one by one), and a way to check wether they have been sent already. How would I do this? I'm using feedparser, this is what I've got so far:
d = feedparser.parse('http://feedparser.org/docs/examples/rss20.xml')
link = d.entries[0].link
title = d.entries[0].title
And then a couple of lines which send an email with "link" and "title" in there. I know I'd need to use the Etag, but haven't been able to work out how, and how would I send the emails 1 by 1?

for the feed parsing part, you could consider following the advise given in this question regarding How to detect changed and new items in an RSS feed?. Basically, you could hash the contents of each entry and use that as an id.
For instance, on the first run of your program it will calculate the hash of each entry, store that hash, and send these new entries by mail. On it's next run, it will rehash each entry's content and compare those hashes with the ones found before (you should use some sort of database for this, or at least an on memory dictionary/list when developing with the entries already parsed and sent). If your program finds hashes that where not generated on the previous runs, it will assemble a new email and send it with the "new" entries.
As for your email assembling part, the question Sending HTML email in Python could help. Just make sure to send a text only and a html version.

For the simplest method see the python smtplib documentation example. (I won't repeat the code here.) It's all you need for basic email sending.
For nicer/more complicated email content also look into python's email module, of course.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.