Wrong encoding of email attachment

Wrong encoding of email attachment - python

I have a python 2.7 script running on windows. It logs in gmail, checks for new e-mails and attachments:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
file_types = ["pdf", "doc", "docx"] # download attachments with these extentions
login = "login"
passw = "password"
imap_server = "imap.gmail.com"
smtp_server = "smtp.gmail.com"
smtp_port = 587
from smtplib import SMTP
from email.parser import HeaderParser
from email.MIMEText import MIMEText
import sys
import imaplib
import getpass
import email
import datetime
import os
import time
if __name__ == "__main__":
try:
while True:
session = imaplib.IMAP4_SSL(imap_server)
try:
rv, data = session.login(login, passw)
print "Logged in: ", rv
except imaplib.IMAP4.error:
print "Login failed!"
sys.exit(1)
rv, mailboxes = session.list()
rv, data = session.select(foldr)
rv, data = session.search(None, "(UNSEEN)")
for num in data[ 0 ].split():
rv, data = session.fetch(num, "(RFC822)")
for rpart in data:
if isinstance(rpart, tuple):
msg = email.message_from_string(rpart[ 1 ])
to = email.utils.parseaddr(msg[ "From" ])[ 1 ]
text = data[ 0 ][ 1 ]
msg = email.message_from_string(text)
got = []
for part in msg.walk():
if part.get_content_maintype() == 'multipart':
continue
if part.get('Content-Disposition') is None:
continue
filename = part.get_filename()
print "file: ", filename
print "Extention: ", filename.split(".")[ -1 ]
if filename.split(".")[ -1 ] not in file_types:
continue
data = part.get_payload(decode = True)
if not data:
continue
date = datetime.datetime.now().strftime("%Y-%m-%d")
if not os.path.isdir("CONTENT"):
os.mkdir("CONTENT")
if not os.path.isdir("CONTENT/" + date):
os.mkdir("CONTENT/" + date)
ftime = datetime.datetime.now().strftime("%H-%M-%S")
new_file = "CONTENT/" + date + "/" + ftime + "_" + filename
f = open(new_file, 'wb')
print "Got new file %s from %s" % (new_file, to)
got.append(filename.encode("utf-8"))
f.write(data)
f.close()
session.close()
session.logout()
time.sleep(60)
except:
print "TARFUN!"
And the problem is that the last print reads garbage:
=?UTF-8?B?0YfQsNGB0YLRjCAxINGC0LXQutGB0YIg0LzQtdGC0L7QtNC40YfQutC4LmRv?=
for example
so later checks don't work. On linux it works just fine.
For now I tryed to d/e[n]code filename to utf-8. But it did nothing. Thanks in advance.

If you read the spec that defines the filename field, RFC 2183, section 2.3, it says:
Current [RFC 2045] grammar restricts parameter values (and hence
Content-Disposition filenames) to US-ASCII. We recognize the great
desirability of allowing arbitrary character sets in filenames, but
it is beyond the scope of this document to define the necessary
mechanisms. We expect that the basic [RFC 1521] 'value'
specification will someday be amended to allow use of non-US-ASCII
characters, at which time the same mechanism should be used in the
Content-Disposition filename parameter.
There are proposed RFCs to handle this. In particular, it's been suggested that filenames be handled as encoded-words, as defined by RFC 5987, RFC 2047, and RFC 2231. In brief this means either RFC 2047 format:
"=?" charset "?" encoding "?" encoded-text "?="
… or RFC 2231 format:
"=?" charset ["*" language] "?" encoded-text "?="
Some mail agents are already using this functionality, others don't know what to do with it. The email package in Python 2.x is among those that don't know what to do with it. (It's possible that the later version in Python 3.x does, or that it may change in the future, but that won't help you if you want to stick with 2.x.) So, if you want to parse this, you have to do it yourself.
In your example, you've got a filename in RFC 2047 format, with charset UTF-8 (which is usable directly as a Python encoding name), encoding B, which means Base-64, and content 0YfQsNGB0YLRjCAxINGC0LXQutGB0YIg0LzQtdGC0L7QtNC40YfQutC4LmRv. So, you have to base-64 decode that, then UTF-8-decode that, and you get u'часть 1 текст методички.do'.
If you want to do this more generally, you're going to have to write code which tries to interpret each filename in RFC 2231 format if possible, in RFC 2047 format otherwise, and does the appropriate decoding steps. This code isn't trivial enough to write in a StackOverflow answer, but the basic idea is pretty simple, as demonstrated above, so you should be able to write it yourself. You may also want to search PyPI for existing implementations.

Related

Encoding error: in MIME file data via AWS SES

I am trying to retrieve attachments data like file format and name of file from MIME via aws SES. Unfortunately some time file name encoding is changed, like file name is "3_amrishmishra_Entry Level Resume - 02.pdf" and in MIME it appears as '=?UTF-8?Q?amrishmishra=5FEntry_Level_Resume_=E2=80=93_02=2Epdf?=', any way to get exact file name?
if email_message.is_multipart():
message = ''
if "apply" in receiver_email.split('#')[0].split('_')[0] and isinstance(int(receiver_email.split('#')[0].split('_')[1]), int):
for part in email_message.walk():
content_type = str(part.get_content_type()).lower()
content_dispo = str(part.get('Content-Disposition')).lower()
print(content_type, content_dispo)
if 'text/plain' in content_type and "attachment" not in content_dispo:
message = part.get_payload()
if content_type in ['application/pdf', 'text/plain', 'application/vnd.openxmlformats-officedocument.wordprocessingml.document', 'image/jpeg', 'image/jpg', 'image/png', 'image/gif'] and "attachment" in content_dispo:
filename = part.get_filename()
# open('/tmp/local' + filename, 'wb').write(part.get_payload(decode=True))
# s3r.meta.client.upload_file('/tmp/local' + filename, bucket_to_upload, filename)
data = {
'base64_resume': part.get_payload(),
'filename': filename,
}
data_list.append(data)
try:
api_data = {
'email_data': email_data,
'resumes_data': data_list
}
print(len(data_list))
response = requests.post(url, data=json.dumps(api_data),
headers={'content-type': 'application/json'})
print(response.status_code, response.content)
except Exception as e:
print("error %s" % e)

This syntax '=?UTF-8?Q?...?=' is a MIME encoded word. It is used in MIME email when a header value includes non-ASCII characters (gory details in RFC 2047). Your attachment filename includes an "en dash" character, which is why it was sent with this encoding.
The best way to handle it depends on which Python version you're using...
Python 3
Python 3's updated email.parser package can correctly decode RFC 2047 headers for you:
# Python 3
from email import message_from_bytes, policy
raw_message_bytes = b"<< the MIME message you downloaded from SES >>"
message = message_from_bytes(raw_message_bytes, policy=policy.default)
for attachment in message.iter_attachments():
# (EmailMessage.iter_attachments is new in Python 3)
print(attachment.get_filename())
# amrishmishra_Entry Level Resume – 02.pdf
You must specifically request policy.default. If you don't, the parser will use a compat32 policy that replicates Python 2.7's buggy behavior—including not decoding RFC 2047. (Also, early Python 3 releases were still shaking out bugs in the new email package, so make sure you're on Python 3.5 or later.)
Python 2
If you're on Python 2, the best option is upgrading to Python 3.5 or later, if at all possible. Python 2's email parser has many bugs and limitations that were fixed with a massive rewrite in Python 3. (And the rewrite added handy new features like iter_attachments() shown above.)
If you can't switch to Python 3, you can decode the RFC 2047 filename yourself using email.header.decode_header:
# Python 2 (also works in Python 3, but you shouldn't need it there)
from email.header import decode_header
filename = '=?UTF-8?Q?amrishmishra=5FEntry_Level_Resume_=E2=80=93_02=2Epdf?='
decode_header(filename)
# [('amrishmishra_Entry Level Resume \xe2\x80\x93 02.pdf', 'utf-8')]
(decoded_string, charset) = decode_header(filename)[0]
decoded_string.decode(charset)
# u'amrishmishra_Entry Level Resume – 02.pdf'
But again, if you're trying to parse real-world email in Python 2.7, be aware that this is probably just the first of several problems you'll encounter.
The django-anymail package I maintain includes a compatibility version of email.parser.BytesParser that tries to work around several (but not all) other bugs in Python 2.7 email parsing. You may be able to borrow that (internal) code for your purposes. (Or since you tagged your question Django, you might want to look into Anymail's normalized inbound email handling, which includes Amazon SES support.)

How do I get unfolded email headers in Python3?

(Note: this question has nothing to do with encoding, as should be clear by reading it. Ignore the suggestion above.)
I'm learning Python and figured a nice tool to start out with would be something that would grab some emails over MIME and display a given header. The following is basically my script:
#!/usr/bin/env python3
from imaplib import IMAP4_SSL
from netrc import netrc
from email import message_from_bytes
conn = IMAP4_SSL('imap.gmail.com')
auth = netrc().hosts['imap.gmail.com']
conn.login(auth[0], auth[2])
conn.select()
typ, data = conn.search(None, 'ALL')
i = 0
for num in reversed(data[0].split()):
i += 1
typ, data = conn.fetch(num, '(RFC822)')
email = message_from_bytes(data[0][1])
print("%i: %s" % (int(num), email.get('subject')))
if i == 5:
break
conn.close()
conn.logout()
The frustrating thing is that the header comes back folded; thus showing through
the underlying email string instead of the actual value inside of the header.
How can I get the correctly unfolded header value? I'd like
to stick with core python3 stuff but I'm open to external deps if I must.

Use Policy Objects to enable unfolding in the Python email package. In your script, you would have to add:
from email.policy import SMTPUTF8
to import the policy SMTPUTF8, and later use that when calling message_from_bytes:
email = message_from_bytes(data[0][1], policy=SMTPUTF8)
I tried your script with Python 3.9.5, actually all policies except compat32 (which is used when the parameter policy is absent) enabled unfolding.

TL;DR: strip newlines
I'd love it if there were a simple answer to this, so if you have a better one feel free to add it. In the meantime, this sorta ghetto solution works perfectly:
#!/usr/bin/env python3
from imaplib import IMAP4_SSL
from netrc import netrc
from email import message_from_bytes
import re
conn = IMAP4_SSL('imap.gmail.com')
auth = netrc().hosts['imap.gmail.com']
conn.login(auth[0], auth[2])
conn.select()
typ, data = conn.search(None, 'ALL')
i = 0
for num in reversed(data[0].split()):
i += 1
typ, data = conn.fetch(num, '(RFC822)')
email = message_from_bytes(data[0][1])
raw_header = email.get('subject')
header = re.sub('[\r\n]', '', header)
print("%i: %s" % (int(num), header))
if i == 5:
break
conn.close()
conn.logout()

Unable to display Japanese (UTF-8) characters in email body with webbrowser

I am reading text from two different .txt files and concatenating them together. Then add that to a body of the email through by using webbrowser.
One text file is English characters (ascii) and the other Japanese (UTF-8). The text will display fine if I write it to a text file. But if I use webbrowser to insert the text into an email body the Japanese text displays as question marks.
I have tried running the script on multiple machines that have different mail clients as their defaults. Initially I thought maybe that was the issue, but that does not appear to be. Thunderbird and Mail (MacOSX) display question marks.
Hello. Today is 2014-05-09
????????????????2014-05-09????
I have looked at similar issues around on SO but they have not solved the issue.
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in
position 20: ordinal not in
range(128)
Japanese in python function
Printing out Japanese (Chinese) characters
python utf-8 japanese
Is there a way to have the Japanese (UTF-8) display in the body of an email created with webbrowser in python? I could use the email functionality but the requirement is the script needs to open the default mail client and insert all the information.
The code and text files I am using are below. I have simplified it to focus on the issue.
email-template.txt
Hello. Today is {{date}}
email-template-jp.txt
こんにちは。今日は {{date}} です。
Python Script
#
# -*- coding: utf-8 -*-
#
import sys
import re
import os
import glob
import webbrowser
import codecs,sys
sys.stdout = codecs.getwriter('utf8')(sys.stdout)
# vars
date_range = sys.argv[1:][0]
email_template_en = "email-template.txt"
email_template_jp = "email-template-jp.txt"
email_to_send = "email-to-send.txt" # finished email is saved here
# Default values for the composed email that will be opened
mail_list = "test#test.com"
cc_list = "test1#test.com, test2#test.com"
subject = "Email Subject"
# Open email templates and insert the date from the parameters sent in
try:
f_en = open(email_template_en, "r")
f_jp = codecs.open(email_template_jp, "r", "UTF-8")
try:
email_content_en = f_en.read()
email_content_jp = f_jp.read()
email_en = re.sub(r'{{date}}', date_range, email_content_en)
email_jp = re.sub(r'{{date}}', date_range, email_content_jp).encode("UTF-8")
# this throws an error
# UnicodeDecodeError: 'ascii' codec can't decode byte 0xe3 in position 26: ordinal not in range(128)
# email_en_jp = (email_en + email_jp).encode("UTF-8")
email_en_jp = (email_en + email_jp)
finally:
f_en.close()
f_jp.close()
pass
except Exception, e:
raise e
# Open the default mail client and fill in all the information
try:
f = open(email_to_send, "w")
try:
f.write(email_en_jp)
# Does not send Japanese text to the mail client. But will write to the .txt file fine. Unsure why.
webbrowser.open("mailto:%s?subject=%s&cc=%s&body=%s" %(mail_list, subject, cc_list, email_en_jp), new=1) # open mail client with prefilled info
finally:
f.close()
pass
except Exception, e:
raise e
edit: Forgot to add I am using Python 2.7.1

EDIT 2: Found a workable solution after all.
Replace your webbrowser call with this.
import subprocess
[... other code ...]
arg = "mailto:%s?subject=%s&cc=%s&body=%s" % (mail_list, subject, cc_list, email_en_jp)
subprocess.call(["open", arg])
This will open your default email client on MacOS. For other OSes please replace "open" in the subprocess line with the proper executable.
EDIT: I looked into it a bit more and Mark's comment above made me read the RFC (2368) for mailto URL scheme.
The special hname "body" indicates that the associated hvalue is the
body of the message. The "body" hname should contain the content for
the first text/plain body part of the message. The mailto URL is
primarily intended for generation of short text messages that are
actually the content of automatic processing (such as "subscribe"
messages for mailing lists), not general MIME bodies.
And a bit further down:
8-bit characters in mailto URLs are forbidden. MIME encoded words (as
defined in [RFC2047]) are permitted in header values, but not for any
part of a "body" hname."
So it looks like this is not possible as per RFC, although that makes me question why the JavaScript solution in the JSFiddle provided by naota works at all.
I leave my previous answer as is below, although it does not work.
I have run into same issues with Python 2.7.x quite a couple of times now and every time a different solution somehow worked.
So here are several suggestions that may or may not work, as I haven't tested them.
a) Force unicode strings:
webbrowser.open(u"mailto:%s?subject=%s&cc=%s&body=%s" % (mail_list, subject, cc_list, email_en_jp), new=1)
Notice the small u right after the opening ( and before the ".
b) Force the regex to use unicode:
email_jp = re.sub(ur'{{date}}', date_range, email_content_jp).encode("UTF-8")
# or maybe
email_jp = re.sub(ur'{{date}}', date_range, email_content_jp)
c) Another idea regarding the regex, try compiling it first with the re.UNICODE flag, before applying it.
pattern = re.compile(ur'{{date}}', re.UNICODE)
d) Not directly related, but I noticed you write the combined text via the normal open method. Try using the codecs.open here as well.
f = codecs.open(email_to_send, "w", "UTF-8")
Hope this helps.

Get the Gmail attachment filename without downloading it

I'm trying to get all the messages from a Gmail account that may contain some large attachments (about 30MB). I just need the names, not the whole files. I found a piece of code to get a message and the attachment's name, but it downloads the file and then read its name:
import imaplib, email
#log in and select the inbox
mail = imaplib.IMAP4_SSL('imap.gmail.com')
mail.login('username', 'password')
mail.select('inbox')
#get uids of all messages
result, data = mail.uid('search', None, 'ALL')
uids = data[0].split()
#read the lastest message
result, data = mail.uid('fetch', uids[-1], '(RFC822)')
m = email.message_from_string(data[0][1])
if m.get_content_maintype() == 'multipart': #multipart messages only
for part in m.walk():
#find the attachment part
if part.get_content_maintype() == 'multipart': continue
if part.get('Content-Disposition') is None: continue
#save the attachment in the program directory
filename = part.get_filename()
fp = open(filename, 'wb')
fp.write(part.get_payload(decode=True))
fp.close()
print '%s saved!' % filename
I have to do this once a minute, so I can't download hundreds of MB of data. I am a newbie into the web scripting, so could anyone help me? I don't actually need to use imaplib, any python lib will be ok for me.
Best regards

Rather than fetch RFC822, which is the full content, you could specify BODYSTRUCTURE.
The resulting data structure from imaplib is pretty confusing, but you should be able to find the filename, content-type and sizes of each part of the message without downloading the entire thing.

If you know something about the file name, you can use the X-GM-RAW gmail extensions for imap SEARCH command. These extensions let you use any gmail advanced search query to filter the messages. This way you can restrict the downloads to the matching messages, or exclude some messages you don't want.
mail.uid('search', None, 'X-GM-RAW',
'has:attachment filename:pdf in:inbox -label:parsed'))
The above search for messages with PDF attachments in INBOX not labeled "parsed".
Some pro tips:
label the messages you have already parsed, so you don't need to fetch them again (the -label:parsed filter in the above example)
always use the uid version instead of the standard sequential ids (you are already doing this)
unfortunately MIME is messy: there are a lot of clients that do weird (or plain wrong) things. You could try to download and parse only the headers, but is it worth the trouble?
[edit]
If you label a message after parsing it, you can skip the messages you have parsed already. This should be reasonable enough to monitor your class mailbox.
Perhaps you live in a corner of the world where internet bandwidth is more expensive than programmer time; in this case, you can fetch only the headers and look for "Content-disposition" == "attachment; filename=somefilename.ext".

A FETCH of the RFC822 message data item is functionally equivalent to BODY[]. IMAP4 supports other message data items, listed in section 6.4.5 of RFC 3501.
Try requesting a different set of message data items to get just the information that you need. For example, you could try RFC822.HEADER or maybe BODY.PEEK[MIME].

Old question, but just wanted to share the solution to this I came up with today. Searches for all emails with attachments and outputs the uid, sender, subject, and a formatted list of attachments. Edited relevant code to show how to format BODYSTRUCTURE:
data = mailobj.uid('fetch', mail_uid, '(BODYSTRUCTURE)')[1]
struct = data[0].split()
list = [] #holds list of attachment filenames
for j, k in enumerate(struct):
if k == '("FILENAME"':
count = 1
val = struct[j + count]
while val[-3] != '"':
count += 1
val += " " + struct[j + count]
list.append(val[1:-3])
elif k == '"FILENAME"':
count = 1
val = struct[j + count]
while val[-1] != '"':
count += 1
val += " " + struct[j + count]
list.append(val[1:-1])
I've also published it on GitHub.
EDIT
Above solution is good but the logic to extract attachment file name from payload is not robust. It fails when file name contains space with first word having only two characters,
for example: "ad cde gh.png".
Try this:
import re # Somewhere at the top
result, data = mailobj.uid("fetch", mail_uid, "BODYSTRUCTURE")
itr = re.finditer('("FILENAME" "([^\/:*?"<>|]+)")', data[0].decode("ascii"))
for match in itr:
print(f"File name: {match.group(2)}")
Test Regex here.

Python newbie - Input strings, return a value to a web page

I've got a program I would like to use to input a password and one or multiple strings from a web page. The program takes the strings and outputs them to a time-datestamped text file, but only if the password matches the set MD5 hash.
The problems I'm having here are that
I don't know how to get this code on the web. I have a server, but is it as easy as throwing pytext.py onto my server?
I don't know how to write a form for the input to this script and how to get the HTML to work with this program. If possible, it would be nice to make it a multi-line input box... but it's not necessary.
I want to return a value to a web page to let the user know if the password authenticated successfully or failed.
dtest
import sys
import time
import getopt
import hashlib
h = hashlib.new('md5')
var = sys.argv[1]
print "Password: ", var
h.update(var)
print h.hexdigest()
trial = h.hexdigest()
check = "86fe2288ac154c500983a8b89dbcf288"
if trial == check:
print "Password success"
time_stamp = time.strftime('%Y-%m-%d_%H-%M-%S', (time.localtime(time.time())))
strFile = "txt_" + str(time_stamp) + ".txt"
print "File created: txt_" + str(time_stamp) + ".txt"
#print 'The command line arguments are:'
#for i in sys.argv:
#print i
text_file = open(strFile, "w")
text_file.write(str(time_stamp) + "\n")
for i in range(2, len(sys.argv)):
text_file.write(sys.argv[i] + "\n")
#print 'Debug to file:', sys.argv[i]
text_file.close()
else:
print "Password failure"

You'll need to read up on mod_python (if you're using Apache) and the Python CGI module.

Take a look at django. It's an excellent web framework that can accomplish exactly what you are asking. It also has an authentication module that handles password hashing and logins for you.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.