Converting ASCII encoded characters inside string in Python

Converting ASCII encoded characters inside string in Python - python

I am using the IMapLib library to read emails from my mailserver. The emails contain JSON encoded messages which my program should interpret.
Mail code:
tmp, data = imap.search(None, "UNSEEN")
emails = []
for num in data[0].split():
tmp, data = imap.fetch(num, "(BODY[TEXT])")
# Only append the email body
emails.append(str(data[0][1]))
The strings I get from imaplib however contain some special characters. I have figured out that the =xx looks like the ASCII encoded version of the 'special' characters. How could I convert a string containing such characters to a 'regular' Python string or am I perhaps missing an option in the imaplib code which is encoding the strings incorrectly?
An example string I get:
b'This is a message in Mime Format. If you see this, your mail reader does not support this format.\r\n\r\n--=_8e336d0902b13eaec4e7906847c21a6d\r\nContent-Type: text/plain; charset=UTF-8\r\nContent-Transfer-Encoding: quoted-printable\r\n\r\n=0A=0A=0A=0A =0A =0A =0A =0A =0A =0A JSON{"arrival":"03.03.21","departure":"07.03.21","email":"test=\r\n=2Etest#gmail.com","apartment":"app","ov=\r\nerride":0}JSON =0A =0A=0A\r\n--=_8e336d0902b13eaec4e7906847c21a6d\r\nContent-Type: text/html; charset=UTF-8\r\nContent-Transfer-Encoding: quoted-printable\r\n\r\n=0A=0A=0A=0A =0A <meta charset=3D"utf-8"=20=\r\n/>=0A <meta http-equiv=3D"Content-Type" content=3D"text/html charset=\r\n=3DUTF-8" />=0A =0A =0A =0A JSON{"arrival":"03.03.21","departure":"07.03.21","email":"test=\r\n=2Etest#gmail.com","apartment":"app","ov=\r\nerride":0}JSON =0A =0A=0A\r\n--=_8e336d0902b13eaec4e7906847c21a6d--\r\n'
I was initially just removing all '\n', '\r' and '=' but today I received this email/string and my code incorrectly interpreted "test=\r\n=2Etest#gmail.com" as "test2Etest#gmail.com" instead of "test.test#gmail.com"

You are dealing with the encoding scheme named "quoted printable" (more details in RFC 2045,section 6.7).
You have at least two options:
You could use the Python module quopri
You could parse your email with the parser of the Python email module (email.parser).
But if your goal is to easily get the email content, it would be easier to use the modules imap_tools or IMAPClient.
Some example code from their documentations:
imap_tools (https://pypi.org/project/imap-tools/):
from imap_tools import MailBox, AND
# get list of email subjects from INBOX folder
with MailBox('imap.mail.com').login('test#mail.com', 'pwd') as mailbox:
subjects = [msg.subject for msg in mailbox.fetch()]
# get list of email subjects from INBOX folder - equivalent verbose version
mailbox = MailBox('imap.mail.com')
mailbox.login('test#mail.com', 'pwd', initial_folder='INBOX') # or mailbox.folder.set instead 3d arg
subjects = [msg.subject for msg in mailbox.fetch(AND(all=True))]
mailbox.logout()
IMAPClient (https://imapclient.readthedocs.io/en/2.1.0/):
from imapclient import IMAPClient
server = IMAPClient('imap.mailserver.com', use_uid=True)
server.login('someuser', 'somepassword')
select_info = server.select_folder('INBOX')
print('%d messages in INBOX' % select_info[b'EXISTS'])
#34 messages in INBOX
messages = server.search(['FROM', 'best-friend#domain.com'])
print("%d messages from our best friend" % len(messages))
#5 messages from our best friend
for msgid, data in server.fetch(messages, ['ENVELOPE']).items():
envelope = data[b'ENVELOPE']

You have hint relating encoding in your message, namely:
Content-Transfer-Encoding: quoted-printable
which explains =s in your text. You might use quopri built-in module for dealing with it, following way:
import quopri
message = b'test=\r\n=2Etest#gmail.com'
decoded = quopri.decodestring(message)
print(decoded)
output:
b'test.test#gmail.com'
Note that quopri.decodestring return bytes, so you would have to make correct .decode if you must have text, if utf-8 is used it will be:
decoded = quopri.decodestring(message).decode('utf-8')

Related

python email module for Gmail / subject

I' m parsing emails in an mbox format with the email module. The email arrived from Gmail.
The important part of the code is:
import email
email_content = sys.stdin.read()
email_obj = email.message_from_string(email_content)
subject = email_obj['subject']
.
For the subject i' m getting a bit weird encoding. In the raw text it looks like:
Subject: =?UTF-8?B?MjAxOS4gw6FwcmlsaXMgMjUu?=
. Can anybody tell me how is it encoded and how do i "extract" it?
Many thanks.
Python: 2.7.13
.

The subject has been encoded according to RFC 2047. This is because an email subject is a header tag, and header tags must be ascii.
To decode:
>>> from email.header import decode_header
>>> decode_header("Subject: =?UTF-8?B?MjAxOS4gw6FwcmlsaXMgMjUu?=")
[('Subject:', None), ('2019. \xc3\xa1prilis 25.', 'utf-8')]
The escaped bytes in the tuple decode as follows:
'2019. április 25.'

Python 3 email body encoding

I am working on setting up a script that forwards incoming mail to a list of recipients.
Here's what I have now:
I read the email from stdin (that's how postfix passes it):
email_in = sys.stdin.read()
incoming = Parser().parse(email_in)
sender = incoming['from']
this_address = incoming['to']
I test for multipart:
if incoming.is_multipart():
for payload in incoming.get_payload():
# if payload.is_multipart(): ...
body = payload.get_payload()
else:
body = incoming.get_payload(decode=True)`
I set up the outgoing message:
msg = MIMEMultipart()
msg['Subject'] = incoming['subject']
msg['From'] = this_address
msg['reply-to'] = sender
msg['To'] = "foo#bar.com"
msg.attach(MIMEText(body.encode('utf-8'), 'html', _charset='UTF-8'))
s = smtplib.SMTP('localhost')
s.send_message(msg)
s.quit()
This works pretty well with ASCII characters (English text), forwards it and all.
When I send non-ascii characters though, it gives back gibberish (depending on email client bytes or ascii representations of the utf-8 chars)
What can be the problem? Is it on the incoming or the outgoing side?

The problem is that many email clients (including Gmail) send non-ascii emails in base64. stdin on the other hand passes everything into a string. If you parse that with Parser.parse(), it returns a string type with base64 inside.
Instead the optional decode argument should be used on the get_payload() method. When that is set, the method returns a bytes type. After that you can use the builtin decode() method to get utf-8 string like so:
body = payload.get_payload(decode=True)
body = body.decode('utf-8')
There is great insight into utf-8 and python in Ned Batchelder's talk.
My final code works a bit differently, you can check that, too here.

python - parse email - special characters [duplicate]

I am displaying new email with IMAP, and everything looks fine, except for one message subject shows as:
=?utf-8?Q?Subject?=
How can I fix it?

In MIME terminology, those encoded chunks are called encoded-words. You can decode them like this:
import email.header
text, encoding = email.header.decode_header('=?utf-8?Q?Subject?=')[0]
Check out the docs for email.header for more details.

This is a MIME encoded-word. You can parse it with email.header:
import email.header
def decode_mime_words(s):
return u''.join(
word.decode(encoding or 'utf8') if isinstance(word, bytes) else word
for word, encoding in email.header.decode_header(s))
print(decode_mime_words(u'=?utf-8?Q?Subject=c3=a4?=X=?utf-8?Q?=c3=bc?='))

The text is encoded as a MIME encoded-word. This is a mechanism defined in RFC2047 for encoding headers that contain non-ASCII text such that the encoded output contains only ASCII characters.
In Python 3.3+, the parsing classes and functions in email.parser automatically decode "encoded words" in headers if their policy argument is set to policy.default
>>> import email
>>> from email import policy
>>> msg = email.message_from_file(open('message.txt'), policy=policy.default)
>>> msg['from']
'Pepé Le Pew <pepe#example.com>'
The parsing classes and functions are:
email.parser.BytesParser
email.parser.Parser
email.message_from_bytes
email.message_from_binary_file
email.message_from_string
email.message_from_file
Confusingly, up to at least Python 3.10, the default policy for these parsing functions is not policy.default, but policy.compat32, which does not decode "encoded words".
>>> msg = email.message_from_file(open('message.txt'))
>>> msg['from']
'=?utf-8?q?Pep=C3=A9?= Le Pew <pepe#example.com>'

Try Imbox
Because imaplib is a very excessive low level library and returns results which are hard to work with
Installation
pip install imbox
Usage
from imbox import Imbox
with Imbox('imap.gmail.com',
username='username',
password='password',
ssl=True,
ssl_context=None,
starttls=False) as imbox:
all_inbox_messages = imbox.messages()
for uid, message in all_inbox_messages:
message.subject

In Python 3, decoding this to an approximated string is as easy as:
from email.header import decode_header, make_header
decoded = str(make_header(decode_header("=?utf-8?Q?Subject?=")))
See the documentation of decode_header and make_header.

High level IMAP lib may be useful here: imap_tools
from imap_tools import MailBox, AND
# get list of email subjects from INBOX folder
with MailBox('imap.mail.com').login('test#mail.com', 'pwd', 'INBOX') as mailbox:
subjects = [msg.subject for msg in mailbox.fetch()]
Parsed email message attributes
Query builder for searching emails
Actions with emails: copy, delete, flag, move, seen
Actions with folders: list, set, get, create, exists, rename, delete, status
No dependencies

Encoding mail subject (SMTP) in Python with non-ASCII characters

I am using Python module MimeWriter to construct a message and smtplib to send a mail constructed message is:
file msg.txt:
-----------------------
Content-Type: multipart/mixed;
from: me<me#abc.com>
to: me#abc.com
subject: 主題
Content-Type: text/plain;charset=utf-8
主題
I use the code below to send a mail:
import smtplib
s=smtplib.SMTP('smtp.abc.com')
toList = ['me#abc.com']
f=open('msg.txt') #above msg in msg.txt file
msg=f.read()
f.close()
s.sendmail('me#abc.com',toList,msg)
I get mail body correctly but subject is not proper,
subject: some junk characters
主題 <- body is correct.
Please suggest? Is there any way to specify the decoding to be used for the subject also,
as being specified for the body. How can I get the subject decoded correctly?

From http://docs.python.org/library/email.header.html
from email.message import Message
from email.header import Header
msg = Message()
msg['Subject'] = Header('主題', 'utf-8')
print msg.as_string()
Subject: =?utf-8?b?5Li76aGM?=
more simple:
from email.header import Header
print Header('主題', 'utf-8').encode()
=?utf-8?b?5Li76aGM?=
as complement decode may made with:
from email.header import decode_header
a = decode_header("""=?utf-8?b?5Li76aGM?=""")[0]
print(a[0].decode(a[1]))
Reference:
Python - email header decoding UTF-8

The subject is transmitted as an SMTP header, and they are required to be ASCII-only. To support encodings in the subject you need to prefix the subject with whatever encoding you want to use. In your case, I would suggest prefix the subject with ?UTF-8?B? which means UTF-8, Base64 encoded.
In other words, I believe your subject header should more or less look like this:
Subject: =?UTF-8?B?JiMyMDAyNzsmIzM4OTg4Ow=?=
In PHP you could go about it like this:
// Convert subject to base64
$subject_base64 = base64_encode($subject);
fwrite($smtp, "Subject: =?UTF-8?B?{$subject_base64}?=\r\n");
In Python:
import base64
subject_base64 = base64.encodestring(subject).strip()
subject_line = "Subject: =?UTF-8?B?%s?=" % subject_base64

In short, if you use the EmailMessage API, you should code like this:
from email.message import EmailMessage
from email.header import Header
msg = EmailMessage()
msg['Subject'] = Header('主題', 'utf-8').encode()
Answer from #Sérgio cannot be used in the EmailMessage API, cause only string object can be assigned to EmailMessage()["Subject"], but not an email.header.Header object.

Python 3 smtplib send with unicode characters

I'm having a problem emailing unicode characters using smtplib in Python 3. This fails in 3.1.1, but works in 2.5.4:
import smtplib
from email.mime.text import MIMEText
sender = to = 'ABC#DEF.com'
server = 'smtp.DEF.com'
msg = MIMEText('€10')
msg['Subject'] = 'Hello'
msg['From'] = sender
msg['To'] = to
s = smtplib.SMTP(server)
s.sendmail(sender, [to], msg.as_string())
s.quit()
I tried an example from the docs, which also failed. http://docs.python.org/3.1/library/email-examples.html, the Send the contents of a directory as a MIME message example
Any suggestions?

The key is in the docs:
class email.mime.text.MIMEText(_text, _subtype='plain', _charset='us-ascii')
A subclass of MIMENonMultipart, the
MIMEText class is used to create MIME
objects of major type text. _text is
the string for the payload. _subtype
is the minor type and defaults to
plain. _charset is the character set
of the text and is passed as a
parameter to the MIMENonMultipart
constructor; it defaults to us-ascii.
No guessing or encoding is performed
on the text data.
So what you need is clearly, not msg = MIMEText('€10'), but rather:
msg = MIMEText('€10'.encode('utf-8'), _charset='utf-8')
While not all that clearly documented, sendmail needs a byte-string, not a Unicode one (that's what the SMTP protocol specifies); look to what msg.as_string() looks like for each of the two ways of building it -- given the "no guessing or encoding", your way still has that euro character in there (and no way for sendmail to turn it into a bytestring), mine doesn't (and utf-8 is clearly specified throughout).

_charset parameter of MIMEText defaults to us-ascii according to the docs. Since € is not from us-ascii set it isn't working.
example in the docs that you've tried clearly states:
For this example, assume that the text file contains only ASCII characters.
You could use .get_charset method on your message to investigate the charset, there is incidentally .set_charset as well.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Converting ASCII encoded characters inside string in Python - python

Related

python email module for Gmail / subject

Python 3 email body encoding

python - parse email - special characters [duplicate]

Encoding mail subject (SMTP) in Python with non-ASCII characters

Python 3 smtplib send with unicode characters

Categories

Resources