Parsing out body and tables from emails

Parsing out body and tables from emails - python

I currently am getting the body/content of the emails in Python using the following:
import email
message = email.message_from_file(open(file))
messages = [part.get_payload() for part in message.walk() if part.get_content_type() == 'text/plain']
This seems to work well in most cases, but I noticed that sometimes there are html tables that don't get picked up. It starts with
<html>
<style type='text/css">
Would it just be to add or part.get_content_tye() == 'text/css'?

If I had to guess, I would guess that you need to add 'text/html'.
However, you should be able to figure out what content-type is in the email by examining the content of that variable.
import email
message = email.message_from_file(open(file))
# Remove the content-type filter completely
messages = [(part.get_payload(), part.get_content_type()) for part in message.walk()]
# print the whole thing out so that you can see what content-types are in there.
print(message)
This will help you see what content types are in there and you can then filter the ones that you need.

Related

Transfer an email with Python

I've tried with no conclusions to resend emails with Python.
Once I've logged in SMTP and IMAP with TLS, this is what I have written:
status, data = self._imapserver.fetch(id, "(RFC822)")
email_data = data[0][1]
# create a Message instance from the email data
message = email.message_from_string(email_data)
# replace headers (could do other processing here)
message.replace_header("From", 'blablabla#bliblibli.com')
message.replace_header("To", 'blobloblo#blublublu.com')
self._smtpserver.sendmail('blablabla#bliblibli.com', 'blobloblo#blublublu.com', message.as_string())
But the problem is that the variable data doesn't catch the information from the email, even if the ID is the one I need.
It tells me:
b'The specified message set is invalid.'
How can I transfer an email with Python?

Like the error message says, whatever you have in id is invalid. We don't know what you put there, so all we can tell you is what's already in the error message.
(Also, probably don't use id as a variable name, as you will shadow the built-in function with the same name.)
There are additional bugs further on in your code; you need to use message_from_bytes if you want to parse it, though there is really no need to replace the headers just to resend it.
status, data = self._imapserver.fetch(correct_id, "(RFC822)")
self._smtpserver.sendmail('blablabla#bliblibli.com', 'blobloblo#blublublu.com', data[0][1])
If you want to parse the message, you should perhaps add a policy argument; this selects the modern EmailMessage API which was introduced in Python 3.6.
from email.policy import default
...
message = email.message_from_bytes(data[0][1], policy=default)
message["From"] = "blablabla#bliblibli.com"
message["To"] = "blobloblo#blublublu.com"
self._smtpserver.send_message(message)
The send_message method is an addition to the new API. If the message could contain other recipient headers like Cc:, Bcc: etc, perhaps using the good old sendmail method would be better, as it ignores the message's headers entirely.

I want to mark the gmail messages Seen by imaplib

I want to parse some gmail emails by python.
I want when the message is read to put it seen.
I put this code but it is not marked as seen?
#read or seen email
mail.store(i,'+FLAGS', '\\Seen')
Do you know how I can keep the email looking whether it is VIEWED?
mport imaplib,dateutil.parser
import email
###################### mail read code ###################
mail=imaplib.IMAP4_SSL('imap.gmail.com',993) #SMTPは993,POPは995
mail.login('example#co.jp','12123')
mail.select('example.jp',readonly=True) #mailbox select read only
#UNSEEN read mail
type,data=mail.search(None,'UNSEEN')
for i in data[0].split(): #data loop
ok,x=mail.fetch(i,'RFC822') #mail information get
ms=email.message_from_string(x[0][1].decode('iso-2022-jp')) #pass get
#from get
ad=email.header.decode_header(ms.get('From'))
ms_code=ad[0][1]
if(ms_code!=None):
address=ad[0][0].decode(ms_code)
address+=ad[1][0].decode(ms_code)
else:
address=ad[0][0]
#Title get
sb=email.header.decode_header(ms.get('Subject'))
ms_code=sb[0][1]
if(ms_code!=None):
sbject=sb[0][0].decode(ms_code)
else:
ms_code=sb[1][1]
sbject=sb[1][0].decode(ms_code)
#body get
maintext=ms.get_payload()
#read email
mail.store(i,'+FLAGS', '\\Seen')
print(sbject)
print(address)
print(maintext)

I guess you may want to have a read of Gmail API reference on labels but very likely you are in need to remove labels UNREAD from the e-mail you want to be marked as read. For that you would have to use REST API or googleapiclient Python library.

If you open your mailbox read-only, you can't make changes to it, including storing flags:
mail.select('example.jp',readonly=True) #mailbox
Remove the read-only flag.

How to read in whole message not just snippet

msg = service.users().messages().get(userId='me', id=message['id']).execute()
print(msg['snippet'])
I am currently using the above code, which doesn't get the whole message. I have seen in the documentation that the google API has raw and full options, but the raw option doesn't print in a readable way and I cannot get the full option to work.
Thank you !

This is how worked for me:
# Gets message header first
msg_header = service.users().messages().get(
userId=user_id,
id=msg_id,
format="full",
metadataHeaders=None
).execute()
# Gets message body from header
body = base64.urlsafe_b64decode(msg_header.get("payload").get("body")\
.get("data").encode("ASCII")).decode("utf-8")
The body comes in HTML so, in my case, I use BeautifulSoup to extract the information I need, like below:
soup = bs(body, 'html.parser')
# Loop on e-mail table
for row in soup.findAll('tr'):
aux = row.findAll('td')
info[aux[0].string] = aux[1].string
The information extraction will depend on the pattern of the message. In my case, all messages that I'm getting have the same pattern.

imaplib: what factors decide maintype is "text" or "multipart"

I have been writing python code for small tool, wherein i am trying to fetch mails using python libraries imaplib and email.
code statement is something like below.
import imaplib
import email
mail = imaplib.IMAP4_SSL('imap.server')
mail.login('userid#mail.com', 'password')
result, data = mail.uid('search', None, "ALL")
latest_email_uid = data[0].split()[-1]
result, data = mail.uid('fetch', latest_email_uid, '(RFC822)')
raw_email = data[0][1]
email_message = email.message_from_string(raw_email)
maintype = email_message_instance.get_content_maintype()
I am executing the script from different host machines simultaneously.
Problem that I am facing is, while fetching the mail body, for same incoming email, on first host mac maintype is evaluated as "text" whereas for other host machine, its evaluated as "multipart" during script execution.
Would like to know how these values are determined at runtime and if I always want maintype to be "multipart", what standard layout should I follow while writing email in email body.

From comments:
raw_email for both cases has raw html code with multiple values. all most all html code is same except few differences. For maintype=multipart, Content-Type="multipart/alternative", boundary tag is present. For maintype=text, Content-Type="text/html", boundary field is not present
Well, that answers the question. get_content_maintype returns first part of the Content-Type, which is multipart for multipart/alternative and text for text/html.
multiplart/alternative means there are multiple alternative versions of the email. Usually, that is html + text. Emails are often sent that way, because then they can be read by any client (the text part), but will still contain HTML formatting which will be used in clients which support it.
Apparently one of the emails was sent with both html and text, whereas the other contains only html.

How can I extract only the email body with Python using IMAP?

I am relatively new to programming and to python, but I think I have done ok so far. This is the code I have, and it works fine, except it gets the entire message in MIME format. I only want the text body of unread emails, but I can't quite figure it out how to strip out all of the formatting and header info. If I send a basic email using a smtp python script that I made it works fine, and only prints the body, but if I send the email using outlook it prints a bunch of extra garbage. Any help is very much appreciated.
client = imaplib.IMAP4_SSL(PopServer)
client.login(USER, PASSWORD)
client.select('INBOX')
status, email_ids = client.search(None, '(UNSEEN SUBJECT "%s")' % PrintSubject)
print email_ids
client.store(email_ids[0].replace(' ',','),'+FLAGS','\Seen')
for email in get_emails(email_ids):
get_emails()
def get_emails(email_ids):
data = []
for e_id in email_ids[0].split():
_, response = client.fetch(e_id, '(UID BODY[TEXT])')
data.append(response[0][1])
return data

Sounds like you're looking for the email package:
The email package provides a standard parser that understands most email document structures, including MIME documents. You can pass the parser a string or a file object, and the parser will return to you the root Message instance of the object structure. For simple, non-MIME messages the payload of this root object will likely be a string containing the text of the message. For MIME messages, the root object will return True from its is_multipart() method, and the subparts can be accessed via the get_payload() and walk() methods.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Parsing out body and tables from emails - python

Related

Transfer an email with Python

I want to mark the gmail messages Seen by imaplib

How to read in whole message not just snippet

imaplib: what factors decide maintype is "text" or "multipart"

How can I extract only the email body with Python using IMAP?

Categories

Resources