Python Message' object has no attribute 'get_body - python

I'm trying to search email body but facing some issues:
#!/usr/local/bin/python3
from email.message import EmailMessage
import email
import imaplib
import re
import sys
import logging
import base64
import os
logging.basicConfig(stream=sys.stdout, level=logging.INFO)
###########log in to mailbox########################
user = 'email#company.com'
pwd = 'pwd'
conn = imaplib.IMAP4_SSL("outlook.office365.com")
conn.login(user,pwd)
conn.select("test")
count = conn.select("test")
resp, items = conn.uid("search" ,None, '(OR (FROM "some#email) (FROM "some#email"))')
items = items[0].split()
for emailid in items:
resp, data = conn.uid("fetch",emailid, "(RFC822)")
if resp == 'OK':
email_body = data[0][1]#.decode('utf-8')
mail = email.message_from_bytes(email_body)
#get all emails with words "PA1" or "PA2" in subject
if mail["Subject"].find("PA1") > 0 or mail["Subject"].find("PA2") > 0:
print (mail)
I have issues in following line:
body = mail.get_body(preferencelist=('plain', 'html'))
getting:
AttributeError: 'Message' object has no attribute 'get_body'

To address the message:
AttributeError: 'Message' object has no attribute 'get_body'
When creating the Message object you need to specify a policy or you get the default email.policy.Compat32 policy. get_body() and several other methods did not exist in Python 3.2.
The line creating the mail object should be:
mail = email.message_from_bytes(data, policy=email.policy.default)
More information at:
https://docs.python.org/3/library/email.policy.html

You should not convert the MIME structure to a string and then feed that to message_from_string. Instead, keep it as a bytes object.
from email.policy import default as default_policy
...
items = items[0].split()
for emailid in items:
resp, data = conn.uid("fetch",emailid, "(RFC822)")
if resp == 'OK':
email_blob = data[0][1]
mail = email.message_from_bytes(email_blob, policy=default_policy)
if not any(x in mail['subject'] for x in ('PA1', 'PA2')):
continue
You are not showing how you are traversing the MIME structure so I sort of assume you are currently not doing that at all. Probably you want something like
# continuation for the above code
body = mail.get_body(preferencelist=('plain', 'html'))
for lines in body.split('\n'):
if line.startswith('MACHINE:'):
result = line[8:].strip()
break
It looks like you have an email body part encoded using Content-Transfer-Encoding: quoted-printable. The above code is robust against various encodings because the email library decodes the encapsulation transparently for you, which gets rid of any QP-escaped line breaks, like the one in your question. For the record, quoted-printable can break up a long line anywere, including in the middle of the value you are attempting to extract, so you really do want to decode before attempting to extract anything.

If it's acceptable for you to first remove all the line breaks =^M\n from the text, then it's quite simple:
import re
email_body = open("1.txt").read().replace("=^M\n", "")
matches = re.findall(r"(?<=MACHINE:)\s*(\w+)", email_body)
print(matches)
print(list(set(matches)))
Output:
['p1prog07', 'p2prog06', 'p2prog06', 'p1prog07', 'ldnv260']
['p2prog06', 'ldnv260', 'p1prog07']
The positive look-behind is a non-capturing group, so the only captured group in the regex is your desired string.

Related

python decode email from base64

hello iam using python script to fetch a message from a specific address mail seems everything work fine but i have a problem with the printable result is a base64 code.
i want to decode the result to get the decode message when do the final result with print, pls help!!
already thanks
the code used.
# Importing libraries
import imaplib, email
user = 'USER_EMAIL_ADDRESS'
password = 'USER_PASSWORD'
imap_url = 'imap.gmail.com'
# Function to get email content part i.e its body part
def get_body(msg):
if msg.is_multipart():
return get_body(msg.get_payload(0))
else:
return msg.get_payload(None, True)
# Function to search for a key value pair
def search(key, value, con):
result, data = con.search(None, key, '"{}"'.format(value))
return data
# Function to get the list of emails under this label
def get_emails(result_bytes):
msgs = [] # all the email data are pushed inside an array
for num in result_bytes[0].split():
typ, data = con.fetch(num, 'BODY.PEEK[1]')
msgs.append(data)
return msgs
# this is done to make SSL connnection with GMAIL
con = imaplib.IMAP4_SSL(imap_url)
# logging the user in
con.login(user, password)
# calling function to check for email under this label
con.select('Inbox')
# fetching emails from this user "tu**h*****1#gmail.com"
msgs = get_emails(search('FROM', 'MY_ANOTHER_GMAIL_ADDRESS', con))
# Uncomment this to see what actually comes as data
# print(msgs)
# Finding the required content from our msgs
# User can make custom changes in this part to
# fetch the required content he / she needs
# printing them by the order they are displayed in your gmail
for msg in msgs[::-1]:
for sent in msg:
if type(sent) is tuple:
# encoding set as utf-8
content = str(sent[1], 'utf-8')
data = str(content)
# Handling errors related to unicodenecode
try:
indexstart = data.find("ltr")
data2 = data[indexstart + 5: len(data)]
indexend = data2.find("</div>")
# printtng the required content which we need
# to extract from our email i.e our body
print(data2[0: indexend])
except UnicodeEncodeError as e:
pass
THE RESULT PRINTED
'''
aGVsbG8gd29yZCBpYW0gdGhlIG1lc3NhZ2UgZnJvbSBnbWFpbA==
'''
You could just use the base64 module to decode base64 encoded strings:
import base64
your_string="aGVsbG8gV29ybGQ==" # the base64 encoded string you need to decode
result = base64.b64decode(your_string.encode("utf8")).decode("utf8")
print(result)
Edit: encoding changed from ASCII to utf-8
If you need to find all encoded places (can be Subject, From, To email addresses with names), the code below might be useful. Given contentData is the entire email,
import re, base64
encodedParts=re.findall('(=\?(.+)\?B\?(.+)\?=)', contentData)
for part in encodedParts:
encodedPart = part[0]
charset = part[1]
encodedContent = part[2]
contentData = contentData.replace(encodedPart, base64.b64decode(encodedContent).decode(charset))

Access all fields in mbox using mailbox

I am attempting to perform some processing on email messages in mbox format.
After searching, and a bit of trial and error tried https://docs.python.org/3/library/mailbox.html#mbox
I have got this to do most of what I want (even though I had to write code to decode subjects) using the test code listed below.
I found this somewhat hit and miss, in particular the key needed to look up fields 'subject' seems to be trial and error, and I can't seem to find any way to list the candidates for a message. (I understand that the fields may differ from email to email.)
Can anyone help me to list the possible values?
I have another issue; the email may contain a number of "Received:" fields e.g.
Received: from awcp066.server-cpanel.com
Received: from mail116-213.us2.msgfocus.com ([185.187.116.213]:60917)
by awcp066.server-cpanel.com with esmtps (TLSv1.2:ECDHE-RSA-AES256-GCM-SHA384:256)
I am interested in accessing the FIRST chronologically - I would be happy to search, but can't seem to find any way to access any but the first in the file.
#! /usr/bin/env python3
#import locale
#2020-08-31
"""
Extract Subject from MBOX file
"""
import os, time
import mailbox
import base64, quopri
def isbqencoded(s):
"""
Test if Base64 or Quoted Printable strings
"""
return s.upper().startswith('=?UTF-8?')
def bqdecode(s):
"""
Convert UTF-8 Base64 or Quoted Printable string to str
"""
nd = s.find('?=', 10)
if s.upper().startswith('=?UTF-8?B?'): # Base64
bbb = base64.b64decode(s[10:nd])
elif s.upper().startswith('=?UTF-8?Q?'): # Quoted Printable
bbb = quopri.decodestring(s[10:nd])
return bbb.decode("utf-8")
def sdecode(s):
"""
Convert possibly multiline Base64 or Quoted Printable strings to str
"""
outstr = ""
if s is None:
return outstr
for ss in str(s).splitlines(): # split multiline strings
sss = ss.strip()
for sssp in sss.split(' '): # split multiple strings
if isbqencoded(sssp):
outstr += bqdecode(sssp)
else:
outstr += sssp
outstr+=' '
outstr = outstr.strip()
return outstr
INBOX = '~/temp/2020227_mbox'
print('Messages in ', INBOX)
mymail = mailbox.mbox(INBOX)
print('Values = ', mymail.values())
print('Keys = ', mymail.keys())
# print(mymail.items)
# for message in mailbox.mbox(INBOX):
for message in mymail:
# print(message)
subject = message['subject']
to = message['to']
id = message['id']
received = message['Received']
sender = message['from']
ddate = message['Delivery-date']
envelope = message['Envelope-to']
print(sdecode(subject))
print('To ', to)
print('Envelope ', envelope)
print('Received ', received)
print('Sender ', sender)
print('Delivery-date ', ddate)
# print('Received ', received[1])
Based on this answer I simplified the Subject decoding, and got similar results.
I am still looking for suggestions to access the remainder of the Header - in particular how to access multiple "Received:" fields.
#! /usr/bin/env python3
#import locale
#2020-09-02
"""
Extract Subject from MBOX file
"""
import os, time
import mailbox
from email.parser import BytesParser
from email.policy import default
INBOX = '~/temp/2020227_mbox'
print('Messages in ', INBOX)
mymail = mailbox.mbox(INBOX, factory=BytesParser(policy=default).parse)
for _, message in enumerate(mymail):
print("date: :", message['date'])
print("to: :", message['to'])
print("from :", message['from'])
print("subject:", message['subject'])
print('Received: ', message['received'])
print("**************************************")
The email message object provides a get_all method which returns all instances of a header, so we can use this to obtain all the values of the received header.
for header in message.get_all('received'):
print('Received', header)
Each header is an instance of UnstructuredHeader. This isn't very helpful for identifying the earliest Received header, as the headers need to be be parsed to extract the dates so that they can be sorted.
However, according to this answer, which quotes the RFC, received headers are always inserted at the beginning of the message. The docstring for EmailMessage.get_all() states:
Return a list of all the values for the named field.
These will be sorted in the order they appeared in the original
message, and may contain duplicates.
So the earliest received header should be the last header in the list returned by EmailMessage.get_all().
Based on a Comment by snakecharmerb (now edited into the Question) I simplified the process.
In the end I did not need to decode received, because the Message-ID actually extracts the id from the original received field.
I list the code I finally used, in case this is of use to others.
This code just extracts header fields of interest and prints them, but the full code performs analysis on the messages.
#! /usr/bin/env python3
#import locale
#2020-09-05
"""
Extract Message Header details from MBOX file
"""
import os, time
import mailbox
from email.parser import BytesParser
from email.policy import default
INBOX = '~/temp/Gmail'
print('Messages in ', INBOX)
mymail = mailbox.mbox(INBOX, factory=BytesParser(policy=default).parse)
for _, message in enumerate(mymail):
date = message['date']
to = message['to']
sender = message['from']
subject = message['subject']
messageID = message['Message-ID']
received = message['received']
deliveredTo = message['Delivered-To']
if(messageID == None): continue
print("Date :", date)
print("From :", sender)
print("To: :", to)
print('Delivered-To:', deliveredTo)
print("Subject :", subject)
print("Message-ID :", messageID)
# print('Received :', received)
print("**************************************")

Spaces replaced by =20 after extracting text from email

I tried to get the text of a received gmail, using the email and imaplib modules in python. After decoding with utf-8 and after getting the payload of the message, all the spaces are still replaced by =20. Can I use another decoding step in order to fix this?
The code is the following: (I got it from a youtube tutorial - https://youtu.be/Jt8LizzxkPU )
``
import email
import imaplib
username = "abc"
password = "123"
mail = imaplib.IMAP4_SSL("imap.gmail.com")
mail.login(username,password)
mail.select("inbox")
result, data = mail.uid("search", None,"ALL")
inbox_item_list = data[0].split()
for item in inbox_item_list:
#most_recent = inbox_item_list[-1]
#oldest = inbox_item_list[0]
result2, email_data = mail.uid('fetch',item,'(RFC822)')
raw_email = email_data[0][1].decode("utf-8")
email_message = email.message_from_string(raw_email)
to_ = email_message['To']
from_ = email_message['From']
subject_ = email_message['Subject']
counter = 1
for part in email_message.walk():
if part.get_content_maintype() == "multipart":
continue
filename = part.get_filename()
if not filename:
ext = ".html"
filename = "msg-part-%08d%s" %(counter, ext)
counter += 1
#save file
content_type = part.get_content_type()
print(subject_)
print (content_type)
if "plain" in content_type:
print(part.get_payload())
elif "html" in content_type:
print("do some beautiful soup")
else:
print(content_type)
``
Try to import quopri, and then when you get the content of the email body (or whatever text that has the =20s inside), you can use quopri.decodestring()
I do it like this
quopri.decodestring(part.get_payload())
But do keep in mind that this is if you quite specifically want to decode from quoted-printable. Normally I would say the answer of #jfs is neater.
Here's a complete code example of how a simple email (that contains both a literal =20 as well as =20 sequence that should be replaced by a space) could be decoded:
#!/usr/bin/env python3
import email.policy
email_text = """Subject: =?UTF-8?B?dGVzdCDwn5OnID0yMA==?=
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
loooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooo=
oooooooooooooooooooooooooooooong=20word
=3D20
^ line starts with =3D20
emoji: <=F0=9F=93=A7>"""
msg = email.message_from_string(
email_text, policy=email.policy.default
)
print("Subject: <{subject}>".format_map(msg))
assert not msg.is_multipart()
print(msg.get_content())
Output
Subject: <test 📧 =20>
loooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooong word
=20
^ line starts with =20
emoji: <📧>
msg.walk(), part.get_payload(decode=True) could be used to traverse more complex EmailMessage objects. See email Examples.

IMAP message gets UnicodeDecodeError 'utf-8' codec can't decode

After 5 hours of trying, time to get some help. Sifted through all the stackoverflow questions related to this but couldn't find the answer.
The code is a gmail parser - works for most emails but some emails cause the UnicodeDecodeError. The problem is "raw_email.decode('utf-8')" but changing it (see comments) causes a different problem down below.
# Source: https://stackoverflow.com/questions/7314942/python-imaplib-to-get-gmail-inbox-subjects-titles-and-sender-name
import datetime
import time
import email
import imaplib
import mailbox
from vars import *
import re # to remove links from str
import string
EMAIL_ACCOUNT = 'gmail_login'
PASSWORD = 'gmail_psswd'
mail = imaplib.IMAP4_SSL('imap.gmail.com')
mail.login(EMAIL_ACCOUNT, PASSWORD)
mail.list()
mail.select('inbox')
result, data = mail.uid('search', None, "ALL") # (ALL/UNSEEN)
id_list = data[0].split()
email_rev = reversed(id_list) # Returns a type list.reverseiterator, which is not list
email_list = list(email_rev)
i = len(email_list)
todays_date = time.strftime("%m/%d/%Y")
for x in range(i):
latest_email_uid = email_list[x]
result, email_data = mail.uid('fetch', latest_email_uid, '(RFC822)')
raw_email = email_data[0][1] # Returns a byte
raw_email_str = raw_email.decode('utf-8') # Returns a str
#raw_email_str = base64.b64decode(raw_email_str1) # Tried this but didn't work.
#raw_email_str = raw_email.decode('utf-8', errors='ignore') # Tried this but caused a TypeError down where var subject is created because something there is expecting a str or byte-like
email_message = email.message_from_string(raw_email_str)
date_tuple = email.utils.parsedate_tz(email_message['Date'])
date_short = f'{date_tuple[1]}/{date_tuple[2]}/{date_tuple[0]}'
# Header Details
if date_short == '12/23/2019':
#if date_tuple:
# local_date = datetime.datetime.fromtimestamp(email.utils.mktime_tz(date_tuple))
# local_message_date = "%s" %(str(local_date.strftime("%a, %d %b %Y %H:%M:%S")))
email_from = str(email.header.make_header(email.header.decode_header(email_message['From'])))
subject = str(email.header.make_header(email.header.decode_header(email_message['Subject'])))
#print(subject)
if email_from.find('restaurants#uber.com') != -1:
print('yay')
# Body details
if email_from.find('restaurants#uber.com') != -1 and subject.find('Payment Summary') != -1:
for part in email_message.walk():
if part.get_content_type() == "text/plain":
body = part.get_payload(decode=True)
body = body.decode("utf-8") # Convert byte to str
body = body.replace("\r\n", " ")
text = re.sub(r'\w+:\/{2}[\d\w-]+(\.[\d\w-]+)*(?:(?:\/[^\s/]*))*', '', body) # removes url links
text2 = text.translate(str.maketrans('', '', string.punctuation))
body_list = re.sub("[^\w]", " ", text2).split()
print(body_list)
print(date_short)
else:
continue
Here is an example how to retrieve and read mail parts with imapclient and the email.* modules from the python standard libs:
from imapclient import IMAPClient
import email
from email import policy
def walk_parts(part, level=0):
print(' ' * 4 * level + part.get_content_type())
# do something with part content (applies encoding by default)
# part.get_content()
if part.is_multipart():
for part in part.get_payload():
get_parts(part, level + 1)
# context manager ensures the session is cleaned up
with IMAPClient(host="your_mail_host") as client:
client.login('user', 'password')
# select some folder
client.select_folder('INBOX')
# do something with folder, e.g. search & grab unseen mails
messages = client.search('UNSEEN')
for uid, message_data in client.fetch(messages, 'RFC822').items():
email_message = email.message_from_bytes(
message_data[b'RFC822'], policy=policy.default)
print(uid, email_message.get('From'), email_message.get('Subject'))
# alternatively search for specific mails
msgs = client.search(['SUBJECT', 'some subject'])
#
# do something with a specific mail:
#
# fetch a single mail with UID 12345
raw_mails = client.fetch([12345], 'RFC822')
# parse the mail (very expensive for big mails with attachments!)
mail = email.message_from_bytes(
raw_mails[12345][b'RFC822'], policy=policy.default)
# Now you have a python object representation of the mail and can dig
# into it. Since a mail can be composed of several subparts we have
# to walk the subparts.
# walk all parts at once
for part in mail.walk():
# do something with that part
print(part.get_content_type())
# or recurse yourself into sub parts until you find the interesting part
walk_parts(mail)
See the docs for email.message.EmailMessage. There you find all needed bits to read into a mail message.
use 'ISO 8859-1' instead of 'utf-8'
I had the same issue And after a lot of research I realized that I simply need to use, message_from_bytes function from email rather than using message_from_string
so for your code simply replace:
raw_email_str = raw_email.decode('utf-8')
email_message = email.message_from_string(raw_email_str)
to
email_message = email.message_from_bytes(raw_email)
should work like a charm :)

Python search imap email for a string

New to python, having some trouble getting past this.
Am getting back emails from gmail via imap (with starter code from https://yuji.wordpress.com/2011/06/22/python-imaplib-imap-example-with-gmail/) and want to search a specific email (which I am able to fetch) for a specific string. Something like this
ids = data[0]
id_list = ids.split()
ids = data[0]
id_list = ids.split()
latest_email_id = id_list[-1]
result, data = mail.fetch(latest_email_id, "(RFC822)")
raw_email = data[0][1]
def search_raw():
if 'gave' in raw_email:
done = 'yes'
else:
done = 'no'
and it always sets done to no. Here's the output for the email (for the body section of the email)
Content-Type multipart/related;boundary=1_56D8EAE1_29AD7EA0;type="text/html"
--1_56D8EAE1_29AD7EA0
Content-Type text/html;charset="UTF-8"
Content-Transfer-Encoding base64
PEhUTUw+CiAgICAgICAgPEhFQUQ+CiAgICAgICAgICAgICAgICA8VElUTEU+PC9USVRMRT4KICAg
ICAgICA8L0hFQUQ+CiAgICAgICAgPEJPRFk+CiAgICAgICAgICAgICAgICA8UCBhbGlnbj0ibGVm
dCI+PEZPTlQgZmFjZT0iVmVyZGFuYSIgY29sb3I9IiNjYzAwMDAiIHNpemU9IjIiPlNlbnQgZnJv
bSBteSBtb2JpbGUuCiAgICAgICAgICAgICAgICA8QlI+X19fX19fX19fX19fX19fX19fX19fX19f
X19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fXzwvRk9OVD48L1A+CgogICAgICAg
ICAgICAgICAgPFBSRT4KR2F2ZQoKPC9QUkU+CiAgICAgICAgPC9CT0RZPgo8L0hUTUw+Cg==
--1_56D8EAE1_29AD7EA0--
I know the issue is the html, but can't seem to figure out how to parse the email properly.
Thank you!
The text above is base64 encoding. Python has a module named base64 which gives you the ability to decode it.
import base64
import re
def has_gave(raw_email):
email_body = base64.b64decode(raw_email)
match = re.search(r'.*gave.*', email_body , re.IGNORECASE)
if match:
done = 'yes'
print 'match found for word ', match.group()
else:
done = 'no'
print 'no match found'
return done

Categories