How to parse German Umlaute and other special characters from emails - python

I am trying to parse an email using python's email-module with its Parser() provided by the email.utils-submodule.
However, there are some special characters which I was not able to parse / convert correctly.
Here is the script I got so far:
import sys
import email
from email.parser import Parser
full_msg = Parser().parse(sys.stdin)
msg = full_msg # this ugly line is part of former debugging
sender = msg['from']
to = msg['to']
subject = msg['subject']
body = msg.get_payload()
date = msg['Date']
fname = '{}.txt'.format(date)
with open(fname, 'w') as f:
f.write('{:10}{}\n'.format('Von:', sender))
f.write('{:10}{}\n'.format('An:', to))
f.write('{:10}{}\n'.format('Betreff:', subject))
f.write('{}\n'.format(body))
Since I am parsing both international as well as German mails I have to deal with the so called 'Umlaute' (ä, ü, ö) and some other characters like ß and the ellipsis (...).
So for example a body like
Würde Dürfte Könnte
get's
W=C3=BCrde D=C3=BCrfte K=C3=B6nnte=
and a subject of
Das dürfte jetzt klappen
becomes
=?utf-8?Q?Das_d=C3=BCrfte_jetzt_klappen?=
Is there a way to deal with those encoding/decoding issues? What am I missing?
UPDATE 1:
The system's language resp. encoding was set to en_US.UTF-8. I changed that to de_DE.UTF-8 by reconfiguring the available locales. However, this did not change the output at all. locale gives:
LANG=de_DE.UTF-8
LANGUAGE=
LC_CTYPE="de_DE.UTF-8"
LC_NUMERIC="de_DE.UTF-8"
LC_TIME="de_DE.UTF-8"
LC_COLLATE="de_DE.UTF-8"
LC_MONETARY="de_DE.UTF-8"
LC_MESSAGES="de_DE.UTF-8"
LC_PAPER="de_DE.UTF-8"
LC_NAME="de_DE.UTF-8"
LC_ADDRESS="de_DE.UTF-8"
LC_TELEPHONE="de_DE.UTF-8"
LC_MEASUREMENT="de_DE.UTF-8"
LC_IDENTIFICATION="de_DE.UTF-8"
LC_ALL=
UPDATE 2:
I found out that this type of string formatting is called Quoted-printable. There is a Python module called quopri to handle this format, but I was unable to get satisfying results. However, I switched to JavaScript using MailParser which works like a charm.

Related

compose Thunderbird with specified reply-to in python

I've been composing emails in python for thunderbird but I can't seem to set the reply-to field. I've tried the following with a few variations.
I don't get any errors with this method, it composes just fine it just won't fill in the "reply-to" field.
'''
def composeEmail():
import subprocess
tbirdPath = r'c:\Program Files (x86)\Mozilla Thunderbird\thunderbird.exe'
to = 'sendto#somewhere.com'
subject = 'This is my subject'
mBody = 'This is the contents'
replyTo = 'replies#somewhere.com'
body = ('<html><body><h1></h1>%s<br></body></html>' % mBody)
composeCommand = 'format=html,to={},reply-to={},subject={},body={}' .format(to, replyTo, subject, body)
subprocess.Popen([tbirdPath, '-compose', composeCommand])
composeEmail()
'''
Unfortunately, this is not possible. You may want to file a bug here. Only these parameters are available at the moment:
https://hg.mozilla.org/comm-central/file/tip/mailnews/compose/src/nsMsgComposeService.cpp#l1356

Python fetching mailboxes with unicode characters

I'm trying to write custom mail agent.
I am trying to fetch all mails, but my mailbox has polish letter in mailboxnames...
So this code (cut all prints from listing):
def parse_list_response(self, line):
list_response_pattern = re.compile(r'\((?P<flags>.*?)\) "(?P<delimiter>.*)" (?P<name>.*)')
line=line.decode(encoding='utf_8')
flags, delimiter, mailbox_name = list_response_pattern.match(line).groups()
mailbox_name = mailbox_name.strip('"')
return (flags, delimiter, mailbox_name)
def fetch_mails(self, from_who, since_when):
server = imaplib.IMAP4_SSL(self.hostname)
server.login(self.owner, self.password)
rc, mailboxes = server.list()
for line in mailboxes:
mailbox=self.parse_list_response(line)[2]
server.select(mailbox)
try:
messages = server.search('FROM "{}"'.format(from_who))
Gives me for example mailbox:
decoded = (\Flagged \HasNoChildren) "/" "[Gmail]/Oznaczone gwiazdk&AQU-"
See: &AQU-... it is polish "ą"
Question is how to get rid of this? I cannot find how to decode this bytecode
The encoding is IMAP4 Modified UTF-7, which is a convention used for international mailbox names, as defined in RFC3501, section 5.1.3.
Unfortunately, the imaplib module doesn't currently support it - although there are several issues on the python bug tracker that suggest that may change in the near future (e.g. issue 5305 and issue 22598).
Anyway, in the meantime, it looks like you will have to find a third-party package to handle this (e.g. imapclient).

Trailing equal signs (=) in emails

I download messages from a Gmail account using POP3 and save them in a SQLite database for futher processing:
mailbox = poplib.POP3_SSL('pop.gmail.com', '995')
mailbox.user(user)
mailbox.pass_(password)
msgnum = mailbox.stat()[0]
for i in range(msgnum):
msg = '\n'.join(mailbox.retr(i+1)[1])
save_message(msg, dbmgr)
mailbox.quit()
However, looking in the database, all lines but the last one of the message body (payload) have trailing equal signs. Do you know why this happens?
Frederic's link lead me to the answer. The encoding is called "quoted printable" (wiki) and it's possible to decode it using the quopri Python module (documentation):
msg.decode('quopri').decode('utf-8')
Update for python 3.x
You now have to invoke the codecs module.
import codecs
bytes_msg = bytes(msg, 'utf-8')
decoded_msg = codecs.decode(bytes_msg, 'quopri').decode('utf-8')

Unable to display Japanese (UTF-8) characters in email body with webbrowser

I am reading text from two different .txt files and concatenating them together. Then add that to a body of the email through by using webbrowser.
One text file is English characters (ascii) and the other Japanese (UTF-8). The text will display fine if I write it to a text file. But if I use webbrowser to insert the text into an email body the Japanese text displays as question marks.
I have tried running the script on multiple machines that have different mail clients as their defaults. Initially I thought maybe that was the issue, but that does not appear to be. Thunderbird and Mail (MacOSX) display question marks.
Hello. Today is 2014-05-09
????????????????2014-05-09????
I have looked at similar issues around on SO but they have not solved the issue.
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in
position 20: ordinal not in
range(128)
Japanese in python function
Printing out Japanese (Chinese) characters
python utf-8 japanese
Is there a way to have the Japanese (UTF-8) display in the body of an email created with webbrowser in python? I could use the email functionality but the requirement is the script needs to open the default mail client and insert all the information.
The code and text files I am using are below. I have simplified it to focus on the issue.
email-template.txt
Hello. Today is {{date}}
email-template-jp.txt
こんにちは。今日は {{date}} です。
Python Script
#
# -*- coding: utf-8 -*-
#
import sys
import re
import os
import glob
import webbrowser
import codecs,sys
sys.stdout = codecs.getwriter('utf8')(sys.stdout)
# vars
date_range = sys.argv[1:][0]
email_template_en = "email-template.txt"
email_template_jp = "email-template-jp.txt"
email_to_send = "email-to-send.txt" # finished email is saved here
# Default values for the composed email that will be opened
mail_list = "test#test.com"
cc_list = "test1#test.com, test2#test.com"
subject = "Email Subject"
# Open email templates and insert the date from the parameters sent in
try:
f_en = open(email_template_en, "r")
f_jp = codecs.open(email_template_jp, "r", "UTF-8")
try:
email_content_en = f_en.read()
email_content_jp = f_jp.read()
email_en = re.sub(r'{{date}}', date_range, email_content_en)
email_jp = re.sub(r'{{date}}', date_range, email_content_jp).encode("UTF-8")
# this throws an error
# UnicodeDecodeError: 'ascii' codec can't decode byte 0xe3 in position 26: ordinal not in range(128)
# email_en_jp = (email_en + email_jp).encode("UTF-8")
email_en_jp = (email_en + email_jp)
finally:
f_en.close()
f_jp.close()
pass
except Exception, e:
raise e
# Open the default mail client and fill in all the information
try:
f = open(email_to_send, "w")
try:
f.write(email_en_jp)
# Does not send Japanese text to the mail client. But will write to the .txt file fine. Unsure why.
webbrowser.open("mailto:%s?subject=%s&cc=%s&body=%s" %(mail_list, subject, cc_list, email_en_jp), new=1) # open mail client with prefilled info
finally:
f.close()
pass
except Exception, e:
raise e
edit: Forgot to add I am using Python 2.7.1
EDIT 2: Found a workable solution after all.
Replace your webbrowser call with this.
import subprocess
[... other code ...]
arg = "mailto:%s?subject=%s&cc=%s&body=%s" % (mail_list, subject, cc_list, email_en_jp)
subprocess.call(["open", arg])
This will open your default email client on MacOS. For other OSes please replace "open" in the subprocess line with the proper executable.
EDIT: I looked into it a bit more and Mark's comment above made me read the RFC (2368) for mailto URL scheme.
The special hname "body" indicates that the associated hvalue is the
body of the message. The "body" hname should contain the content for
the first text/plain body part of the message. The mailto URL is
primarily intended for generation of short text messages that are
actually the content of automatic processing (such as "subscribe"
messages for mailing lists), not general MIME bodies.
And a bit further down:
8-bit characters in mailto URLs are forbidden. MIME encoded words (as
defined in [RFC2047]) are permitted in header values, but not for any
part of a "body" hname."
So it looks like this is not possible as per RFC, although that makes me question why the JavaScript solution in the JSFiddle provided by naota works at all.
I leave my previous answer as is below, although it does not work.
I have run into same issues with Python 2.7.x quite a couple of times now and every time a different solution somehow worked.
So here are several suggestions that may or may not work, as I haven't tested them.
a) Force unicode strings:
webbrowser.open(u"mailto:%s?subject=%s&cc=%s&body=%s" % (mail_list, subject, cc_list, email_en_jp), new=1)
Notice the small u right after the opening ( and before the ".
b) Force the regex to use unicode:
email_jp = re.sub(ur'{{date}}', date_range, email_content_jp).encode("UTF-8")
# or maybe
email_jp = re.sub(ur'{{date}}', date_range, email_content_jp)
c) Another idea regarding the regex, try compiling it first with the re.UNICODE flag, before applying it.
pattern = re.compile(ur'{{date}}', re.UNICODE)
d) Not directly related, but I noticed you write the combined text via the normal open method. Try using the codecs.open here as well.
f = codecs.open(email_to_send, "w", "UTF-8")
Hope this helps.

Python char encoding

I have the following code :
msgtxt = "é"
msg = MIMEText(msgtxt)
msg.set_charset('ISO-8859-1')
msg['Subject'] = "subject"
msg['From'] = "from#mail.com"
msg['To'] = "to#mail.com"
serv.sendmail("from#mail.com","to#mail.com", msg.as_string())
The e-mail arrive with é as its body instead of the expected é
I have tried :
msgtxt = "é".encode("ISO-8859-1")
msgtxt = u"é"
msgtxt = unicode("é", "ISO-8859-1")
all yield the same result.
How to make this work?
Any help is appreciated.
Thanks in advance, J.
msgtxt = "é"
msg.set_charset('ISO-8859-1')
Well, what's the encoding of the source file containing this code? If it's UTF-8, which is a good default choice, just writing the é will have given you the two-byte string '\xc3\xa9', which, when viewed as ISO-8859-1, looks like é.
If you want to use non-ASCII byte string literals in your source file without having to worry about what encoding the text editor is saving it as, use a string literal escape:
msgtxt = '\xE9'
# coding: utf-8 (or whatever you want to save your source file in)
msgtxt = u"é"
msg = MIMEText(msgtxt,_charset='ISO-8859-1')
Without the u the text will be in the source encoding. As a Unicode string, msgtxt will be encoded in the indicated character set.

Categories