python - parse email - special characters [duplicate] - python

I am displaying new email with IMAP, and everything looks fine, except for one message subject shows as:
=?utf-8?Q?Subject?=
How can I fix it?

In MIME terminology, those encoded chunks are called encoded-words. You can decode them like this:
import email.header
text, encoding = email.header.decode_header('=?utf-8?Q?Subject?=')[0]
Check out the docs for email.header for more details.

This is a MIME encoded-word. You can parse it with email.header:
import email.header
def decode_mime_words(s):
return u''.join(
word.decode(encoding or 'utf8') if isinstance(word, bytes) else word
for word, encoding in email.header.decode_header(s))
print(decode_mime_words(u'=?utf-8?Q?Subject=c3=a4?=X=?utf-8?Q?=c3=bc?='))

The text is encoded as a MIME encoded-word. This is a mechanism defined in RFC2047 for encoding headers that contain non-ASCII text such that the encoded output contains only ASCII characters.
In Python 3.3+, the parsing classes and functions in email.parser automatically decode "encoded words" in headers if their policy argument is set to policy.default
>>> import email
>>> from email import policy
>>> msg = email.message_from_file(open('message.txt'), policy=policy.default)
>>> msg['from']
'Pepé Le Pew <pepe#example.com>'
The parsing classes and functions are:
email.parser.BytesParser
email.parser.Parser
email.message_from_bytes
email.message_from_binary_file
email.message_from_string
email.message_from_file
Confusingly, up to at least Python 3.10, the default policy for these parsing functions is not policy.default, but policy.compat32, which does not decode "encoded words".
>>> msg = email.message_from_file(open('message.txt'))
>>> msg['from']
'=?utf-8?q?Pep=C3=A9?= Le Pew <pepe#example.com>'

Try Imbox
Because imaplib is a very excessive low level library and returns results which are hard to work with
Installation
pip install imbox
Usage
from imbox import Imbox
with Imbox('imap.gmail.com',
username='username',
password='password',
ssl=True,
ssl_context=None,
starttls=False) as imbox:
all_inbox_messages = imbox.messages()
for uid, message in all_inbox_messages:
message.subject

In Python 3, decoding this to an approximated string is as easy as:
from email.header import decode_header, make_header
decoded = str(make_header(decode_header("=?utf-8?Q?Subject?=")))
See the documentation of decode_header and make_header.

High level IMAP lib may be useful here: imap_tools
from imap_tools import MailBox, AND
# get list of email subjects from INBOX folder
with MailBox('imap.mail.com').login('test#mail.com', 'pwd', 'INBOX') as mailbox:
subjects = [msg.subject for msg in mailbox.fetch()]
Parsed email message attributes
Query builder for searching emails
Actions with emails: copy, delete, flag, move, seen
Actions with folders: list, set, get, create, exists, rename, delete, status
No dependencies

Related

Converting ASCII encoded characters inside string in Python

I am using the IMapLib library to read emails from my mailserver. The emails contain JSON encoded messages which my program should interpret.
Mail code:
tmp, data = imap.search(None, "UNSEEN")
emails = []
for num in data[0].split():
tmp, data = imap.fetch(num, "(BODY[TEXT])")
# Only append the email body
emails.append(str(data[0][1]))
The strings I get from imaplib however contain some special characters. I have figured out that the =xx looks like the ASCII encoded version of the 'special' characters. How could I convert a string containing such characters to a 'regular' Python string or am I perhaps missing an option in the imaplib code which is encoding the strings incorrectly?
An example string I get:
b'This is a message in Mime Format. If you see this, your mail reader does not support this format.\r\n\r\n--=_8e336d0902b13eaec4e7906847c21a6d\r\nContent-Type: text/plain; charset=UTF-8\r\nContent-Transfer-Encoding: quoted-printable\r\n\r\n=0A=0A=0A=0A =0A =0A =0A =0A =0A =0A JSON{"arrival":"03.03.21","departure":"07.03.21","email":"test=\r\n=2Etest#gmail.com","apartment":"app","ov=\r\nerride":0}JSON =0A =0A=0A\r\n--=_8e336d0902b13eaec4e7906847c21a6d\r\nContent-Type: text/html; charset=UTF-8\r\nContent-Transfer-Encoding: quoted-printable\r\n\r\n=0A=0A=0A=0A =0A <meta charset=3D"utf-8"=20=\r\n/>=0A <meta http-equiv=3D"Content-Type" content=3D"text/html charset=\r\n=3DUTF-8" />=0A =0A =0A =0A JSON{"arrival":"03.03.21","departure":"07.03.21","email":"test=\r\n=2Etest#gmail.com","apartment":"app","ov=\r\nerride":0}JSON =0A =0A=0A\r\n--=_8e336d0902b13eaec4e7906847c21a6d--\r\n'
I was initially just removing all '\n', '\r' and '=' but today I received this email/string and my code incorrectly interpreted "test=\r\n=2Etest#gmail.com" as "test2Etest#gmail.com" instead of "test.test#gmail.com"
You are dealing with the encoding scheme named "quoted printable" (more details in RFC 2045,section 6.7).
You have at least two options:
You could use the Python module quopri
You could parse your email with the parser of the Python email module (email.parser).
But if your goal is to easily get the email content, it would be easier to use the modules imap_tools or IMAPClient.
Some example code from their documentations:
imap_tools (https://pypi.org/project/imap-tools/):
from imap_tools import MailBox, AND
# get list of email subjects from INBOX folder
with MailBox('imap.mail.com').login('test#mail.com', 'pwd') as mailbox:
subjects = [msg.subject for msg in mailbox.fetch()]
# get list of email subjects from INBOX folder - equivalent verbose version
mailbox = MailBox('imap.mail.com')
mailbox.login('test#mail.com', 'pwd', initial_folder='INBOX') # or mailbox.folder.set instead 3d arg
subjects = [msg.subject for msg in mailbox.fetch(AND(all=True))]
mailbox.logout()
IMAPClient (https://imapclient.readthedocs.io/en/2.1.0/):
from imapclient import IMAPClient
server = IMAPClient('imap.mailserver.com', use_uid=True)
server.login('someuser', 'somepassword')
select_info = server.select_folder('INBOX')
print('%d messages in INBOX' % select_info[b'EXISTS'])
#34 messages in INBOX
messages = server.search(['FROM', 'best-friend#domain.com'])
print("%d messages from our best friend" % len(messages))
#5 messages from our best friend
for msgid, data in server.fetch(messages, ['ENVELOPE']).items():
envelope = data[b'ENVELOPE']
You have hint relating encoding in your message, namely:
Content-Transfer-Encoding: quoted-printable
which explains =s in your text. You might use quopri built-in module for dealing with it, following way:
import quopri
message = b'test=\r\n=2Etest#gmail.com'
decoded = quopri.decodestring(message)
print(decoded)
output:
b'test.test#gmail.com'
Note that quopri.decodestring return bytes, so you would have to make correct .decode if you must have text, if utf-8 is used it will be:
decoded = quopri.decodestring(message).decode('utf-8')

non-recursive walk of email message from mailbox message

I'm trying to work with email messages in Python 3.7 and struggling with what looks like compatibility issues. The docs mention email.message.Message having an iter_parts method that should allow me to do a non-recursive walk of message parts.
This doesn't exist on messages returned from mailbox messages and it's taken me a while to get it behaving. For example, I can generate a dummy message with:
from email.message import EmailMessage
msg = EmailMessage()
msg['Subject'] = 'msg 1'
msg.add_alternative("Plain text body", subtype='plain')
msg.add_alternative("<html><body><p>HTML body</p></body></html>", subtype='html')
msg.add_attachment(b"Nothing to see here!", maintype='data', subtype='raw')
and then dump out the parts with:
def iter_parts(msg):
ret = msg.get_content_type()
if msg.is_multipart():
parts = ', '.join(iter_parts(m) for m in msg.iter_parts())
ret = f'{ret} [{parts}]'
return ret
iter_parts(msg)
which gives me: multipart/mixed [multipart/alternative [text/plain, text/plain], data/raw]
but if I save this to a mbox file and reload it:
import mailbox
mbox = mailbox.mbox('/tmp/test.eml')
mbox.add(msg)
iter_parts(mbox[0])
it tells me AttributeError: 'mboxMessage' object has no attribute 'iter_parts'
Initially I thought it might be related to https://stackoverflow.com/a/45804980/1358308 but setting factory=None doesn't seem to do much in Python 3.7.
Am posting my solution, but would like to know if there are better options!
After much poking and reading of source I found that I can instead do:
from email import policy
from email.parser import BytesParser
mbox = mailbox.mbox('/tmp/test.eml', factory=BytesParser(policy=policy.default).parse)
and then I get objects with an iter_parts method.

Python: Attaching MIME encoded text file

After a bunch of fiddling, I finally hit upon the magical sequence to attach a text file to an email (many thanks to previous posts on this service).
I'm left wondering what the lines:
attachment.add_header('Content-Disposition'. . .)
--and--
e_msg = MIMEMultipart('alternative')
actually do.
Can someone unsilence the Mimes for me please (sorry couldn't resist)
import smtplib
from email import Encoders
from email.message import Message
from email.mime.text import MIMEText
from email.mime.multipart import MIMEMultipart
smtp_server = "1.2.3.4"
smtp_login = "account"
smpt_password = "password"
server = smtplib.SMTP(smtp_server)
server.login(smtp_login,smtp_password)
f = file("filename.csv")
attachment = MIMEText(f.read())
attachment.add_header('Content-Disposition', 'attachment', filename="filename.csv")
e_msg = MIMEMultipart('alternative')
e_msg.attach(attachment)
e_msg['Subject'] = 'Domestic Toll Monitor'
e_msg['From'] = smtp_account
body = 'Some nifty text goes here'
content = MIMEText(body)
e_msg.attach(content)
server.sendmail(smtp_from, smtp_to, e_msg.as_string())
Basically, MIME is the specification defining email structure. The Multipart structure is designed to allow for multiple types of messages and attachments to be sent within the same message. For example, an email might have a plain text version for backwards compatibility and a rich text or html formatted message for modern clients. Attachments count as a "part", and thus require their own header. In this case, you're adding a "Content-Disposition" type header for the attachment. If you're really interested in what that means, you can read the specification here. As for the "Alternative portion, you're setting the message to multipart and defining the types of parts that you have attached and how the client needs to handle them. There are some standard presets defining various scenarios, but Alternative is something of a wildcard, used when there is a part whose type might not be recognized or handled by most clients. For the record, I believe you also could have used a "Mixed" type. The nice thing about MIME is that while it is complicated, its thoroughly defined and its very easy to look up the specification.

Encoding mail subject (SMTP) in Python with non-ASCII characters

I am using Python module MimeWriter to construct a message and smtplib to send a mail constructed message is:
file msg.txt:
-----------------------
Content-Type: multipart/mixed;
from: me<me#abc.com>
to: me#abc.com
subject: 主題
Content-Type: text/plain;charset=utf-8
主題
I use the code below to send a mail:
import smtplib
s=smtplib.SMTP('smtp.abc.com')
toList = ['me#abc.com']
f=open('msg.txt') #above msg in msg.txt file
msg=f.read()
f.close()
s.sendmail('me#abc.com',toList,msg)
I get mail body correctly but subject is not proper,
subject: some junk characters
主題 <- body is correct.
Please suggest? Is there any way to specify the decoding to be used for the subject also,
as being specified for the body. How can I get the subject decoded correctly?
From http://docs.python.org/library/email.header.html
from email.message import Message
from email.header import Header
msg = Message()
msg['Subject'] = Header('主題', 'utf-8')
print msg.as_string()
Subject: =?utf-8?b?5Li76aGM?=
more simple:
from email.header import Header
print Header('主題', 'utf-8').encode()
=?utf-8?b?5Li76aGM?=
as complement decode may made with:
from email.header import decode_header
a = decode_header("""=?utf-8?b?5Li76aGM?=""")[0]
print(a[0].decode(a[1]))
Reference:
Python - email header decoding UTF-8
The subject is transmitted as an SMTP header, and they are required to be ASCII-only. To support encodings in the subject you need to prefix the subject with whatever encoding you want to use. In your case, I would suggest prefix the subject with ?UTF-8?B? which means UTF-8, Base64 encoded.
In other words, I believe your subject header should more or less look like this:
Subject: =?UTF-8?B?JiMyMDAyNzsmIzM4OTg4Ow=?=
In PHP you could go about it like this:
// Convert subject to base64
$subject_base64 = base64_encode($subject);
fwrite($smtp, "Subject: =?UTF-8?B?{$subject_base64}?=\r\n");
In Python:
import base64
subject_base64 = base64.encodestring(subject).strip()
subject_line = "Subject: =?UTF-8?B?%s?=" % subject_base64
In short, if you use the EmailMessage API, you should code like this:
from email.message import EmailMessage
from email.header import Header
msg = EmailMessage()
msg['Subject'] = Header('主題', 'utf-8').encode()
Answer from #Sérgio cannot be used in the EmailMessage API, cause only string object can be assigned to EmailMessage()["Subject"], but not an email.header.Header object.

Python 3 smtplib send with unicode characters

I'm having a problem emailing unicode characters using smtplib in Python 3. This fails in 3.1.1, but works in 2.5.4:
import smtplib
from email.mime.text import MIMEText
sender = to = 'ABC#DEF.com'
server = 'smtp.DEF.com'
msg = MIMEText('€10')
msg['Subject'] = 'Hello'
msg['From'] = sender
msg['To'] = to
s = smtplib.SMTP(server)
s.sendmail(sender, [to], msg.as_string())
s.quit()
I tried an example from the docs, which also failed. http://docs.python.org/3.1/library/email-examples.html, the Send the contents of a directory as a MIME message example
Any suggestions?
The key is in the docs:
class email.mime.text.MIMEText(_text, _subtype='plain', _charset='us-ascii')
A subclass of MIMENonMultipart, the
MIMEText class is used to create MIME
objects of major type text. _text is
the string for the payload. _subtype
is the minor type and defaults to
plain. _charset is the character set
of the text and is passed as a
parameter to the MIMENonMultipart
constructor; it defaults to us-ascii.
No guessing or encoding is performed
on the text data.
So what you need is clearly, not msg = MIMEText('€10'), but rather:
msg = MIMEText('€10'.encode('utf-8'), _charset='utf-8')
While not all that clearly documented, sendmail needs a byte-string, not a Unicode one (that's what the SMTP protocol specifies); look to what msg.as_string() looks like for each of the two ways of building it -- given the "no guessing or encoding", your way still has that euro character in there (and no way for sendmail to turn it into a bytestring), mine doesn't (and utf-8 is clearly specified throughout).
_charset parameter of MIMEText defaults to us-ascii according to the docs. Since € is not from us-ascii set it isn't working.
example in the docs that you've tried clearly states:
For this example, assume that the text file contains only ASCII characters.
You could use .get_charset method on your message to investigate the charset, there is incidentally .set_charset as well.

Categories