Parse email and send reply using googleapiclient in Python

Parse email and send reply using googleapiclient in Python - python

I'm currently working on a project and I have chosen to use Gmail for sending and receiving emails. I want to be able to send an email, have a user reply to it, and parse their response. The response can be any number of lines (so something like response.split('\n')[0] won't work). It should then be able to reply directly to that email thread.
I've been following the googleapiclient tutorials, but they leave a lot to be desired. However, I've managed to read email threads using:
service.users.threads().get(userId='me', id=thread_id).execute()
where thread_id is (predictably) the ID of the email thread (which I find elsewhere). In the large dict returned by this, there is a section of base64 data which contains the content of the email. This was the only place I could find the actual data for the response. Unfortunately, I get this when it is decoded:
b'This is my response from my phone\r\n\r\nOn Sat, 28 Nov 2020, 8:40 PM , <myemail#gmail.com>\r\nwrote:\r\n\r\n> This is sent from the python script\r\n>\r\n'
This is all the data in the thread, however, I only want the response as there is clearly no way to split this to get only the data I need. The best I can think of is to parse out anything of the form On <date>, <time>, but that could lead to problems. There must be another way to extract only This is my response from my phone and no other data.
Once I get the response, I want to parse it and reply with an appropriate response based on the contents of the message. I would prefer to reply directly to the thread, rather than starting a new one. Unfortunately, all the Google documentation says is:
If you're trying to send a reply and want the email to thread, make sure that:
The Subject headers match
The References and In-Reply-To headers follow the RFC 2822 standard.
The documentation provides this code (with some minor modifications by me) for sending an email:
def create_message(sender, to, subject, message_text):
message = MIMEText(message_text)
message['to'] = to
message['from'] = sender
message['subject'] = subject
return {'raw': base64.urlsafe_b64encode(message.as_bytes()).decode()}
Sending a reply with the same subject line is pretty straight forward (message['subject'] = same_subject_as_before), but I don't even know where to start with the References and In-Reply-To headers. How do I set these?

Why is this hard?
You are trying to use e-mail for something it simply wasn't originally designed for. My impression is you want the e-mail response to contain structured data, but e-mail text lacks any well-defined structure. It also depends on which e-mail client the other user has, and whether they send HTML e-mail or not.
This is usually easy for a human to see, but difficult for a computer. Which suggests that Machine Learning might be the best strategy if you want higher reliability. Whatever solution you choose, it's not going to be 100% reliable.
E-mail can be plain text or HTML, or both.
There is no well-defined structure to separate replies from the original text. Wikipedia lists a few different "posting styles".
In the old days when "Netiquette" was still cool, putting your reply on top ("top-posting") was considered bad practice, and new Internet users were told by old folks to avoid top-posting. Some users still reply below or interleaved with the original text.
The reply line (e.g. "On DATE, EMAIL wrote:" or "-------- Original Message --------") will be different, depending on which e-mail client is used, what language that client is set to, and the user's own preferences.
Using a text delimiter
A class of software which faces a similar problem as the one you describe is customer service applications, which allow operators to use e-mail for communication. A common strategy is to inject some unique text in your templates for outgoing e-mail. For example, Zendesk uses a text "delimiter" such as:
##- Please type your reply above this line -##
This serves two purposes; it tells users to top-post, and it provides a separator to cut out most of the irrelevant text.
If you first handle any HTML encoding, you should be able to split the message by such a text delimiter. It's not perfect, but it usually works.
Use products made by others
There are some open source options, such as:
https://github.com/zapier/email-reply-parser
And I found a commercial product, SigParser, which seems to use a machine learning model that they've trained very carefully:
https://sigparser.com/developers/extract-reply-chains-from-emails/
They also explain some of the challenges of parsing e-mail text into structured data.

Related

Find emails that haven't been replied with Python & Exchangelib

Currently I am using Python and Exchangelib module to build a macro that shows emails that haven't been replied.
Background for the macro:
A support group of 3 peoples get daily lot of emails from customers to "support#abc.com".
One of 3 peoples will reply back using the same email "support#abc.com" as sender.
Due to high amount of daily inbox and the fact that 3 peoples share the same email "support#abc.com" to respond, human error happens from time to time and therefore some emails stay unreplied.
What I would like to try is to use the following symbol as the sign if the email is replied.
I could not figure out what the attribute for that is called.
I have compared all attributes for the second and the third emails side by side. I was expecting that the second email has a certain boolean attribute X with value "True" while the third email "False" (or vice versa):
Does such a boolean attribute exist? If no, how could my web browser show the symbol on my first screenshot?
If it does not exist, how would you solve it?
Another alternative to solve it would involve any "support#abc.com"-reply would need to be sent not only to the customer but also to "support#abc.com" itself as CC or normal recipient.
After that I just need to read the attribute "conversation_id" and compare it to other earlier emails.
I don't like the alternative because of the CC, it would create a new element in "the solution" that is prone to human error.
Any inputs would be welcome.
Thank you in advance.

I don't know of any fields on the Message object in EWS that tells you directly whether that message has a reply.
I think your best bet is to use the conversation_id of the message and check your Sent folder for that conversation_id. I believe that's what OWA does - messages where only one message is known with that conversation ID will not get the "replied" icon.

How to parse email body in Python? [duplicate]

I want to retrieve body (only text) of emails using python imap and email package.
As per this SO thread, I'm using the following code:
mail = email.message_from_string(email_body)
bodytext = mail.get_payload()[ 0 ].get_payload()
Though it's working fine for some instances, but sometime I get similar to following response
[<email.message.Message instance at 0x0206DCD8>, <email.message.Message instance at 0x0206D508>]

You are assuming that messages have a uniform structure, with one well-defined "main part". That is not the case; there can be messages with a single part which is not a text part (just an "attachment" of a binary file, and nothing else) or it can be a multipart with multiple textual parts (or, again, none at all) and even if there is only one, it need not be the first part. Furthermore, there are nested multiparts (one or more parts is another MIME message, recursively).
In so many words, you must inspect the MIME structure, then decide which part(s) are relevant for your application. If you only receive messages from a fairly static, small set of clients, you may be able to cut some corners (at least until the next upgrade of Microsoft Plague hits) but in general, there simply isn't a hierarchy of any kind, just a collection of (not necessarily always directly related) equally important parts.

The main problem in my case is that replied or forwarded message shown as message instance in the bodytext.
Solved my problem using the following code:
bodytext=mail.get_payload()[0].get_payload();
if type(bodytext) is list:
bodytext=','.join(str(v) for v in bodytext)

My external lib: https://github.com/ikvk/imap_tools
from imap_tools import MailBox
# get list of email bodies from INBOX folder
with MailBox('imap.mail.com').login('test#mail.com', 'pwd', 'INBOX') as mailbox:
bodies = [msg.text or msg.html for msg in mailbox.fetch()]

Maybe this post (of mine) can be of help. I receive a Newsletter with prices of different kind of oil in the US. I fetch email in gmail with a given pattern for the title, then I extract the prices in the mail body using regex. So i have to access the mail body for the last n emails which title observe given pattern.
I am using email.message_from_string() also: msg = email.message_from_string(response_part[1])
so maybe it gives you concrete example of how to use methods in this python lib.

Information Extraction from Text into Structured Data with Python

I'm near a total outsider of programming, just interested in it.
I work in a Shipbrokering company and need to match between positions (which ship will be open at where, when) and orders (what kind of ships will be needed at where, when for what kind of employment).
And we send and receive such info (positions and orders) by emails to and from our principals and co-brokers.
There are thousands of such emails each day.
We do the matching by reading the emails manually.
I want to build an app to do the matching for us.
One important part of this app will do the information extraction from email text.
==> My question is how do I use Python to extract unstructured info into structured data.
Sample email of an order [annotation in the brackets, but is not included in the email]:
Email Subject: 20k dwt requirement, 20-30/mar, Santos-Conti
Content:
Acct ABC [Account Name]
Abt 20,000 MT Deadweight [Size of Ship Needed]
Delivery to make Santos [Delivery Point/Range, Owners will deliver the ship to Charterers here]
Laycan 20-30/Mar [Laycan (the time spread in which delivery can be accepted]
1 time charter with grains [What kind of Empolyment/Trade, Cargo]
Duration about 35 days [Duration]
Redelivery 1 safe port Continent [Redelivery Point/Range, Charterers will redeliver the ship back to Owners here.]
Broker name/email/phone...
End Email
Same email above can be written in many different ways - some writes in one line, some use l/c instead of laycan...
And there are emails for positions with ship's name, open port, date range, ship's deadweight and other specs.
How can I extract the info and put it into structured data, with Python?
Let's say I have put all email contents into text files.
Thanks.

Below is a possible approach:
Step 1: Classify the mails in categories using the subject and/or message in the mail.
As you stated one category is of mails requesting position and the other is of mails of order.
Machine Learning can be used to classify. You can use set of previous mails as training corpus. You might consider using NLTK(Natural Langauage Toolkit) for Python. Here is the link on text classification using NLTK.
Step 2: Once an email is identified as an order mail, process it to fetch the details(account name, size, time spread etc.) As you mentioned the challenge here is that there is no fixed format for these data. To solve this problem, you might consider preparing an exhaustive list of synonyms for each label(like for account the list could be like ['acct', 'a/c', 'account', 'acnt']). This should be done once, by going through a fixed volume of previous mails.
To make the solution more effective, you could consider implementing option for active learning
(i.e., prompt the user if in a mail a lable is found which is not found in any list. E.g. in a mail, if "accnt" is used, it wont be resolved, hence user should be prompted to ask in which category it falls.)
Once a lable is identifies, you can use basic string operations, to parse the email in a fetch relevant data in structured format.
You can refer to this discussion for a better understanding.

How hard is it to build an Email client? - Python

I'm venturing in unknown territory here...
I am trying to work out how hard it could be to implement an Email client using Python:
Email retrieval
Email sending
Email formatting
Email rendering
Also I'm wondering if all protocols are easy/hard to support e.g. SMTP, IMAP, POP3, ...
Hopefully someone could point me in the right direction :)

The Python language does offer raw support for the needed protocols in its standard library. Properly using then, and, properly parsing and assembling a "modern day" e-mail message, however can be tough to do.
Also, you didn't say if you want to create a graphical interface for your e-mail client -- if you want to have a proper graphical interface -- up to the point of being usable, it is quite a lot of work.
Local e-mail storage would be the easier part - unless you want to properly implement an mbox file format RFC-4155 so that other software can easily read/write the messgaes you have fetched, you can store them in as Python Objects using an ORM or an Object Oriented database, such as ZODB, or MongoDB.
If you want more than a toy e-mail app, you will have a lot of work - properly encoding e-mail headers, for example, server authentication and secure authentication and transport layers, decoding of the e-mail text body itself for non ASCII messages. Although the modules on the Python standard library do implement a lot of that, their documentation falls short on examples - and a complete e-mail client would have to use all of then.
Certainly the place to start an e-mail client, even a toy one, would be taking a look on the most recent RFC's for e-mail (and you will have to pick then from here http://www.ietf.org/rfc/rfc-index since just looking for "email rfc" on google gives a poor result).

I think you will find much of the clients important parts prepackaged:
Email retrieval - I think that is covered by many of the Python libraries.
Email sending - This would not be hard and it is most likely covered as well.
Email formatting - I know this is covered because I just used it to parse single and multipart emails for a client.
Email rendering - I would shoot for an HTML renderer of some sort. There is a Python interface to the renderer from the Mozilla project. I would guess there are other rendering engines that have python interfaces as well. I know wxWidgets has some simple HTML facilities and would be a lot lighter weight. Come to think about it the Mozilla engine may have a bunch of the other functions you would need as well. You would have to research each of the parts.
There is lot more to it than what is listed above. Like anything worth while it won't be built in a day. I would lay out precisely what you want it to do. Then start putting together a prototype. Just build a simple framework that does basic things. Like only have it support the text part of a message with no html. Then build on that.
I am amazed at the wealth of coding modules available with Python. I needed to filter html email messages, parse stylesheets, embed styles, and whole host of other things. I found just about every function I needed in a Python library somewhere. I was especially happy when I found out that some css sheets are gzipped that there was a module for that!
So if you are serious about then dig in. You will learn a LOT. :)

I have made two libraries that solve some of those problems easily:
Sending emails: Red Mail (SMTP)
Receiving emails: Red Box (IMAP)
Here is a short example of both:
from redbox import EmailBox
from redmail import EmailSender
USERNAME = "me#example.com"
PASSWORD = "<PASSWORD>"
box = EmailBox(
host="imap.example.com",
port=993,
username=USERNAME,
password=PASSWORD
)
sender = EmailSender(
host="smtp.example.com",
port=587,
username=USERNAME,
password=PASSWORD
)
Then you can send emails:
email.send(
subject='email subject',
sender="me#example.com",
receivers=['you#example.com'],
text="Hi, this is an email.",
html="""
<h1>Hi,</h1>
<p>this is an email.</p>
""",
attachments={
'data.csv': Path('path/to/file.csv'),
'raw_file.html': '<h1>Just some HTML</h1>',
}
)
Or read emails:
from redbox.query import UNSEEN, FROM
# Select an email folder
inbox = box["INBOX"]
# Search and process messages
for msg in inbox.search(UNSEEN & FROM('they#example.com')):
# Process the message
print(msg.headers)
print(msg.from_)
print(msg.to)
print(msg.subject)
print(msg.text_body)
print(msg.html_body)
# Set the message as read/seen
msg.read()
Red Box fully supports logical operations using the query language if you need complex logical operations. You can also easily access various parts of the messages.
Links, Red Mail:
Source code
Documentation
Links, Red Box:
Source code
Documentation

If I were you, I'd check out the source code of existing email-clients to get an idea: thunderbird, sylpheed-claws, mutt...
Depending on the set of features you want to support, it is a big project.

Depends to what level you want to build the client. You can quickly whip something up with libraries like smtplib for handling conection/data. And tk for a GUI. But again it all depends on the level of finish your after.
A quick basic tool for yourself: Easy. (With libraries)
Writing a full-feutured email client: Hard.
Instead of using a library, you can also find an open source project you can contribute to. I'd recommend having a look at Mailpile

Email title and link from rss-feed and email them

I'm doing a bit of an experiment in Python. I'm making a script which checks a rss-feed for new items, and then sends the title and link of the items via email. I've got the script to work to a certain level: when it runs it will take the link+title of the newest item and email it, regardless of wether it emailed that file already or not. I'd need to add 2 things: a way to get multiple items at once (and email those, one by one), and a way to check wether they have been sent already. How would I do this? I'm using feedparser, this is what I've got so far:
d = feedparser.parse('http://feedparser.org/docs/examples/rss20.xml')
link = d.entries[0].link
title = d.entries[0].title
And then a couple of lines which send an email with "link" and "title" in there. I know I'd need to use the Etag, but haven't been able to work out how, and how would I send the emails 1 by 1?

for the feed parsing part, you could consider following the advise given in this question regarding How to detect changed and new items in an RSS feed?. Basically, you could hash the contents of each entry and use that as an id.
For instance, on the first run of your program it will calculate the hash of each entry, store that hash, and send these new entries by mail. On it's next run, it will rehash each entry's content and compare those hashes with the ones found before (you should use some sort of database for this, or at least an on memory dictionary/list when developing with the entries already parsed and sent). If your program finds hashes that where not generated on the previous runs, it will assemble a new email and send it with the "new" entries.
As for your email assembling part, the question Sending HTML email in Python could help. Just make sure to send a text only and a html version.

For the simplest method see the python smtplib documentation example. (I won't repeat the code here.) It's all you need for basic email sending.
For nicer/more complicated email content also look into python's email module, of course.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.