Python email quoted-printable encoding problem - python

I am extracting emails from Gmail using the following:
def getMsgs():
try:
conn = imaplib.IMAP4_SSL("imap.gmail.com", 993)
except:
print 'Failed to connect'
print 'Is your internet connection working?'
sys.exit()
try:
conn.login(username, password)
except:
print 'Failed to login'
print 'Is the username and password correct?'
sys.exit()
conn.select('Inbox')
# typ, data = conn.search(None, '(UNSEEN SUBJECT "%s")' % subject)
typ, data = conn.search(None, '(SUBJECT "%s")' % subject)
for num in data[0].split():
typ, data = conn.fetch(num, '(RFC822)')
msg = email.message_from_string(data[0][1])
yield walkMsg(msg)
def walkMsg(msg):
for part in msg.walk():
if part.get_content_type() != "text/plain":
continue
return part.get_payload()
However, some emails I get are nigh impossible for me to extract dates (using regex) from as encoding-related chars such as '=', randomly land in the middle of various text fields. Here's an example where it occurs in a date range I want to extract:
Name: KIRSTI Email:
kirsti#blah.blah Phone #: + 999
99995192 Total in party: 4 total, 0
children Arrival/Departure: Oct 9=
,
2010 - Oct 13, 2010 - Oct 13, 2010
Is there a way to remove these encoding characters?

You could/should use the email.parser module to decode mail messages, for example (quick and dirty example!):
from email.parser import FeedParser
f = FeedParser()
f.feed("<insert mail message here, including all headers>")
rootMessage = f.close()
# Now you can access the message and its submessages (if it's multipart)
print rootMessage.is_multipart()
# Or check for errors
print rootMessage.defects
# If it's a multipart message, you can get the first submessage and then its payload
# (i.e. content) like so:
rootMessage.get_payload(0).get_payload(decode=True)
Using the "decode" parameter of Message.get_payload, the module automatically decodes the content, depending on its encoding (e.g. quoted printables as in your question).

If you are using Python3.6 or later, you can use the email.message.Message.get_content() method to decode the text automatically. This method supersedes get_payload(), though get_payload() is still available.
Say you have a string s containing this email message (based on the examples in the docs):
Subject: Ayons asperges pour le =?utf-8?q?d=C3=A9jeuner?=
From: =?utf-8?q?Pep=C3=A9?= Le Pew <pepe#example.com>
To: Penelope Pussycat <penelope#example.com>,
Fabrette Pussycat <fabrette#example.com>
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: quoted-printable
MIME-Version: 1.0
Salut!
Cela ressemble =C3=A0 un excellent recipie[1] d=C3=A9jeuner.
[1] http://www.yummly.com/recipe/Roasted-Asparagus-Epicurious-203718
--Pep=C3=A9
=20
Non-ascii characters in the string have been encoded with the quoted-printable encoding, as specified in the Content-Transfer-Encoding header.
Create an email object:
import email
from email import policy
msg = email.message_from_string(s, policy=policy.default)
Setting the policy is required here; otherwise policy.compat32 is used, which returns a legacy Message instance that doesn't have the get_content method. policy.default will eventually become the default policy, but as of Python3.7 it's still policy.compat32.
The get_content() method handles decoding automatically:
print(msg.get_content())
Salut!
Cela ressemble à un excellent recipie[1] déjeuner.
[1] http://www.yummly.com/recipe/Roasted-Asparagus-Epicurious-203718
--Pepé
If you have a multipart message, get_content() needs to be called on the individual parts, like this:
for part in message.iter_parts():
print(part.get_content())

That's known as quoted-printable encoding. You probably want to use something like quopri.decodestring - http://docs.python.org/library/quopri.html

Related

PyWin32 excluding one specific instance of tag on all emails read in from PST

I've been developing a Python tool to ingest and write all emails from a PST exported from Outlook to individual .html files. The issue is that when opening the PST in outlook and checking the source information for emails individually, it includes this specific line:
<meta http-equiv=Content-Type content="text/html; charset=utf-8">
which IS NOT being included when importing the PST with Pywin32 and reading all the emails in the PST. To see what it looks like in a chunk -
From Outlook: <html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:w="urn:schemas-microsoft-com:office:word" xmlns:m="http://schemas.microsoft.com/office/2004/12/omml" xmlns="http://www.w3.org/TR/REC-html40"><head><meta http-equiv=Content-Type content="text/html; charset=utf-8"><meta name=Generator content="Microsoft Word 15 (filtered medium)">
What is exported from the tool: <html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:w="urn:schemas-microsoft-com:office:word" xmlns:m="http://schemas.microsoft.com/office/2004/12/omml" xmlns="http://www.w3.org/TR/REC-html40"><head><meta name=Generator content="Microsoft Word 15 (filtered medium)">
The contents of the emails are otherwise ENTIRELY identical except for that one tag.
My code:
htmlEmails = 0
encryptedEmails = 0
totalEmails = 0
richPlainEmails = 0
filenameCount = 1
mycounter2 = 1
#Adjusting name of PST location to be readable
selectedPST = str(selectedPST.replace('/', '\\'))
print('\nRunning:' , selectedPST)
outlook.AddStore(selectedPST)
PSTFolderObj = find_pst_folder(outlook, selectedPST)
def find_pst_folder(OutlookObj, pst_filepath):
for Store in OutlookObj.Stores:
if Store.IsDataFileStore and Store.FilePath == pst_filepath:
return Store.GetRootFolder()
return None
def enumerate_folders(FolderObj):
for ChildFolder in FolderObj.Folders:
enumerate_folders(ChildFolder)
iterate_messages(FolderObj)
def iterate_messages(FolderObj):
global mycounter2
global encryptedEmails
global richPlainEmails
global totalEmails
global htmlEmails
for item in FolderObj.Items:
totalEmails += 1
try:
try:
body_content = item.HTMLbody
mysubject = item.Subject
writeToFile(body_content, exportPath, mysubject)
mycounter2 = mycounter2 + 1
htmlEmails += 1
except AttributeError:
#print('Non HTML formatted email, passing')
richPlainEmails += 1
pass
except Exception as e:
encryptedEmails += 1
pass
def writeToFile(messageHTML, path, mysubject):
global mycounter2
filename = '\htmloutput' + str(mycounter2) + '.html'
#Check if email is rich or plain text first (only HTML emails are desired)
if '<!-- Converted from text/plain format -->' in messageHTML or '<!-- Converted from text/rtf format -->' in messageHTML:
raise AttributeError()
else:
file = open(path + filename, "x", encoding='utf-8')
try:
messageHTML = regex.sub('\r\n', '\n', messageHTML)
file.write(messageHTML)
#Handle any potential unexpected Unicode error
except Exception as e:
print('Exception: ' , e)
try:
#Prints email subject to more easily find the offending email
print('Subject: ', mysubject)
print(mycounter2)
file.write(messageHTML)
except Exception as e:
print('Tried utf decode: ', e)
file.close()
Because the emails otherwise are identical, I can only assume this is being done by the library. I'm wondering if there's a reason that meta tag is excluded, or if its a bug in PyWin32?
After much exploration and discussion with people familiar with PyWin32 and having my code reviewed and tested, it seems Outlook is the bad actor here.
I discovered that Outlook was causing the exact same behavior when attaching an email to another email. That is, if I sent an email, I could check the source info and it contained the information. When I then attached it to another email, the file would have the tag stripped.
As a result, I switched to LibPFF ( https://pypi.org/project/libpff-python/ ) to circumvent this. It allows PST's to be read in as well as parse through the HTML of emails within the PST.
The code for LibPFF looks like this: (just include the path to the PST in the path+pstname spot):
import pypff
pst_file = pypff.file()
pst_file.open(path+pstname)
root = pst_file.get_root_folder()
for folder in root.sub_folders:
for sub in folder.sub_folders:
for message in sub.sub_messages:
body_content = message.get_html_body()
print(str(body_content))
This is essentially a work around, but can provide the same results depending on the use case. As for why Outlook does this, I can only assume that the tag with the charset info is required for sending the email over their server, so when it's attached to another email, it's seen as useless and is stripped.

Facing problem to decode ?UTF-8?B?ZnVjayDwn5CO?=! type in subject. Using IMAP and Python

Need to get real string instead of that encoded string. Few subjects are proper in string format but few are in this encoded format, I don't know how to solve it.
How can I decode the string and print the decoded part of the subject?
FROM_EMAIL = "my_id#gmail.com"
FROM_PWD = "my Password"
SMTP_SERVER = "imap.gmail.com"
SMTP_PORT = 993
l=['Developer','Architect','NEED','Internship','Urgent']
def get_body(msg):
if msg.is_multipart():
return get_body(msg.get_payload(0))
else:
return msg.get_payload(None,True)
def readmail():
mail = imaplib.IMAP4_SSL(SMTP_SERVER)
mail.login(FROM_EMAIL,FROM_PWD)
mail.select('inbox')
type, data = mail.search(None, '(SINCE "20-May-2020" BEFORE "26-May-2020")')
mail_ids = data[0]
id_list = mail_ids.split()
id_list=id_list[::-1]
first_email_id = id_list[0]
latest_email_id = id_list[-1]
for byte_obj in id_list:
typ, data = mail.fetch(byte_obj, '(RFC822)' )
raw=email.message_from_bytes(data[0][1])
msg=get_body(raw)
s=''
s=raw['SUBJECT']
s1=raw['Date']
print(s)
readmail()
output:
Winner announcement! Amazon Kindle Oasis.
[FREE WEBINAR] Natural Language Processing for Beginners
Godrej 24 | Get Rs. 2 Lakh Gold Voucher | 2 & 3 BHK at Rs. 83 Lakh*
=?UTF-8?B?TGFzdCBkYXkgdG8gc2F2ZSEgUG9wdWxhciBjb3Vyc2VzIGFzIGw=?=
=?UTF-8?B?b3cgYXMg4oK5NDU1?=
Panda just uploaded a video
Vernix Gamerz just uploaded a video
Most of your question has been answered here:
Find, decode and replace all base64 values in text file
In order to better understand your example I have some additional information:
Part of your subject lines are encoded in the base64-Format.
Take the following part of your string s=raw['SUBJECT'] as example
=?UTF-8?B?TGFzdCBkYXkgdG8gc2F2ZSEgUG9wdWxhciBjb3Vyc2VzIGFzIGw=?=
=?UTF-8?B?b3cgYXMg4oK5NDU1?=
The structure is as follows:
First you have:
?UTF-8?B?
Then comes the encoded string:
TGFzdCBkYXkgdG8gc2F2ZSEgUG9wdWxhciBjb3Vyc2VzIGFzIGw
Followed by
=?
Converting the encoded string from base64 to UTF-8 gives you the text:
Last day to save! Popular courses as l
You can verify this under https://www.base64decode.org/

Modify Subject, Body after message_from_string in Python

I am trying to modify an email2sms script for Smstools 3.
A sample incoming sms file:
$ cat /var/spool/sms/incoming/GSM1.AtEO8G
From: 950
From_TOA: D0 alphanumeric, unknown
From_SMSC: 421950900050
Sent: 17-09-13 17:41:17
Received: 17-09-13 17:48:21
Subject: GSM1
Modem: GSM1
IMSI: 231030011459971
Report: no
Alphabet: ISO
Length: 5
test1
The script is using the following code to format the message:
if (statuscode == 'RECEIVED'):
smsfile = open(smsfilename)
msg = email.message_from_string(smsfile.read())
msg['Original-From'] = msg['From']
msg['To'] = forwardto
The problem: I want to modify Subject field in the code above. I tried something msg['Subject '] = 'Example' (after msg['To']), but the Subject field is not overwrited, but doubled. Anybody knows how to modify this after email.message_from_string() function?
You want to replace Subject header for message.
msg.replace_header('Subject', 'Example Subject')
Assigning to an index always adds a new header. Only use when header doesn't exist.
msg['Subject'] = 'Example Subject' # add new subject header
print(msg.items)
>> [('From', '950'), ('From_TOA', 'D0 alphanumeric, unknown'),
('From_SMSC', '421950900050'), ('Sent', '17-09-13 17:41:17'),
('Received', '17-09-13 17:48:21'), ('Subject', 'GSM 1'),
('Modem', 'GSM1'), ('IMSI', '231030011459971'),
('Report', 'no'), ('Alphabet', 'ISO'),
('Length', '5'), ('Original-From', '950'),
('Subject', 'Example Subject')]

Python Retrieve Email Addresses

I want to create a Python program to resend a email using "email" and "smtplib" package.
import email
import smtplib
f = open('email_source.eml')
em = email.message_from_file(f)
email_from = em['FROM'] # '"Me" <me#xyz.com>'
email_to = em['TO'] # '"John, A" <john#abc.com>, "Peter, B" <peter#def.com>'
In above case, I have 2 recipients, I want to resend to these 2 person by smtplib.
import smtplib
smtp = smtplib.SMTP('localhost', '25')
smtp.sendmail(email_from, email_to, em.as_string())
If I put the string email_to into sendmail, it only send the email to first people. If I replace the email_to by a list,
email_to_list = ['"John, A" <jphn#abc.com>', '"Peter, B" <peter#def.com>']
the email can sent to both person.
My problem is, can I extract the recipients to a list from the em['TO'] and em['CC'] string?
Thank you.
The problem is that smtp.sendmail requires a list of addresses, according to the documentation:
SMTP.sendmail(from_addr, to_addrs, msg, mail_options=[], rcpt_options=[])
Send mail. The required arguments are an RFC 822 from-address string, a list of RFC 822 to-address strings (a bare string will be treated as a list with 1 address) […]
From the email-package you get a string, which the smtp-package then interprets as only one address.
In simple words, you need to split your to-address-string into a list of addresses.
How do you do this? You could do it manually, but it's best to just rely on the library:
import email.utils
email_to_raw = '"John, A" <john#abc.com>, "Peter, B" <peter#def.com>'
# split into (Name, Addr) tuple
email_to_split = email.utils.getaddresses([email_to_raw])
# combine the tuples into addresses, but keep the list
email_to = [email.utils.formataddr(pair) for pair in email_to_split]
print(email_to) # ['"John, A" <john#abc.com>', '"Peter, B" <peter#def.com>']
After swearing a bit at the designer of the API, you wrap it up into a function:
import email.utils
def split_combined_addresses(addresses):
parts = email.utils.getaddresses(addresses)
return [email.utils.formataddr(name_addr) for name_addr in parts]
print(split_combined_addresses(email_to))
This is how I do.
I assume emails are separated with semi-colon,but you can replace ; to ,
Here is sample data
addresses='"Johnny Test" <johnny#test.com>; Jack <another#test.com>; "Scott Summers" <scotts#test.com>; noname#test.com'
Source code:
import email.utils
import re
def split_combined_addresses(addresses):
#remove special chars
addresses = re.sub("\n|\r|\t", "", addresses)
# addrs = re.findall(r'(.*?)\s<(.*?)>,?', addresses) #colon separated
addrs = re.findall(r'(.*?)\s<(.*?)>;?', addresses) #semicolon separated
# remove leading space in name .strip()
# remove double-quote in name
addrs_clean = [(i.replace('"','').strip(), j) for i,j in addrs]
# add missing emails without name
emails = re.findall(r"[\w.+-]+#[\w-]+\.[\w.-]+", addresses)
for email in emails:
if (not email in list(zip(*addrs_clean))[1]):
addrs_clean.append(('', email))
return addrs_clean

Get python getaddresses() to decode encoded-word encoding

msg = \
"""To: =?ISO-8859-1?Q?Caren_K=F8lter?= <ck#example.dk>, bob#example.com
Cc: "James =?ISO-8859-1?Q?K=F8lter?=" <jk#example.dk>
Subject: hello
message body blah blah blah
"""
import email.parser, email.utils
import itertools
parser = email.parser.Parser()
parsed_message = parser.parsestr(msg)
address_fields = ('to', 'cc')
addresses = itertools.chain(*(parsed_message.get_all(field) for field in address_fields if parsed_message.has_key(field)))
address_list = set(email.utils.getaddresses(addresses))
print address_list
It seems like email.utils.getaddresses() doesn't seem to automatically handle MIME RFC 2047 in address fields.
How can I get the expected result below?
actual result:
set([('', 'bob#example.com'), ('=?ISO-8859-1?Q?Caren_K=F8lter?=', 'ck#example.dk'), ('James =?ISO-8859-1?Q?K=F8lter?=', 'jk#example.dk')])
desired result:
set([('', 'bob#example.com'), (u'Caren_K\xf8lter', 'ck#example.dk'), (u'James \xf8lter', 'jk#example.dk')])
The function you want is email.header.decode_header, which returns a list of (decoded_string, charset) pairs. It's up to you to further decode them according to charset and join them back together again before passing them to email.utils.getaddresses or wherever.
You might think that this would be straightforward:
def decode_rfc2047_header(h):
return ' '.join(s.decode(charset or 'ascii')
for s, charset in email.header.decode_header(h))
But since message headers typically come from untrusted sources, you have to handle (1) badly encoded data; and (2) bogus character set names. So you might do something like this:
def decode_safely(s, charset='ascii'):
"""Return s decoded according to charset, but do so safely."""
try:
return s.decode(charset or 'ascii', 'replace')
except LookupError: # bogus charset
return s.decode('ascii', 'replace')
def decode_rfc2047_header(h):
return ' '.join(decode_safely(s, charset)
for s, charset in email.header.decode_header(h))
Yeah, the email package interface really isn't very helpful a lot of the time.
Here, you have to use email.header.decode_header manually on each address, and then, since that gives you a list of decoded tokens, you have to stitch them back together again manually:
for name, address in email.utils.getaddresses(addresses):
name= u' '.join(
unicode(b, e or 'ascii') for b, e in email.header.decode_header(name)
)
...
Thank you Gareth Rees.Your answer was helpful in solving a problem case:
Input: 'application/octet-stream;\r\n\tname="=?utf-8?B?KFVTTXMpX0FSTE8uanBn?="'
The absence of whitespace around the encoded-word caused email.Header.decode_header to overlook it. I'm too new to this to know if I've only made things worse, but this kludge, along with joining with a '' instead of ' ', fixed it:
if not ' =?' in h:
h = h.replace('=?', ' =?').replace('?=', '?= ')
Output: u'application/octet-stream; name="(USMs)_ARLO.jpg"'

Categories