Encoding error: in MIME file data via AWS SES - python

I am trying to retrieve attachments data like file format and name of file from MIME via aws SES. Unfortunately some time file name encoding is changed, like file name is "3_amrishmishra_Entry Level Resume - 02.pdf" and in MIME it appears as '=?UTF-8?Q?amrishmishra=5FEntry_Level_Resume_=E2=80=93_02=2Epdf?=', any way to get exact file name?
if email_message.is_multipart():
message = ''
if "apply" in receiver_email.split('#')[0].split('_')[0] and isinstance(int(receiver_email.split('#')[0].split('_')[1]), int):
for part in email_message.walk():
content_type = str(part.get_content_type()).lower()
content_dispo = str(part.get('Content-Disposition')).lower()
print(content_type, content_dispo)
if 'text/plain' in content_type and "attachment" not in content_dispo:
message = part.get_payload()
if content_type in ['application/pdf', 'text/plain', 'application/vnd.openxmlformats-officedocument.wordprocessingml.document', 'image/jpeg', 'image/jpg', 'image/png', 'image/gif'] and "attachment" in content_dispo:
filename = part.get_filename()
# open('/tmp/local' + filename, 'wb').write(part.get_payload(decode=True))
# s3r.meta.client.upload_file('/tmp/local' + filename, bucket_to_upload, filename)
data = {
'base64_resume': part.get_payload(),
'filename': filename,
}
data_list.append(data)
try:
api_data = {
'email_data': email_data,
'resumes_data': data_list
}
print(len(data_list))
response = requests.post(url, data=json.dumps(api_data),
headers={'content-type': 'application/json'})
print(response.status_code, response.content)
except Exception as e:
print("error %s" % e)

This syntax '=?UTF-8?Q?...?=' is a MIME encoded word. It is used in MIME email when a header value includes non-ASCII characters (gory details in RFC 2047). Your attachment filename includes an "en dash" character, which is why it was sent with this encoding.
The best way to handle it depends on which Python version you're using...
Python 3
Python 3's updated email.parser package can correctly decode RFC 2047 headers for you:
# Python 3
from email import message_from_bytes, policy
raw_message_bytes = b"<< the MIME message you downloaded from SES >>"
message = message_from_bytes(raw_message_bytes, policy=policy.default)
for attachment in message.iter_attachments():
# (EmailMessage.iter_attachments is new in Python 3)
print(attachment.get_filename())
# amrishmishra_Entry Level Resume – 02.pdf
You must specifically request policy.default. If you don't, the parser will use a compat32 policy that replicates Python 2.7's buggy behavior—including not decoding RFC 2047. (Also, early Python 3 releases were still shaking out bugs in the new email package, so make sure you're on Python 3.5 or later.)
Python 2
If you're on Python 2, the best option is upgrading to Python 3.5 or later, if at all possible. Python 2's email parser has many bugs and limitations that were fixed with a massive rewrite in Python 3. (And the rewrite added handy new features like iter_attachments() shown above.)
If you can't switch to Python 3, you can decode the RFC 2047 filename yourself using email.header.decode_header:
# Python 2 (also works in Python 3, but you shouldn't need it there)
from email.header import decode_header
filename = '=?UTF-8?Q?amrishmishra=5FEntry_Level_Resume_=E2=80=93_02=2Epdf?='
decode_header(filename)
# [('amrishmishra_Entry Level Resume \xe2\x80\x93 02.pdf', 'utf-8')]
(decoded_string, charset) = decode_header(filename)[0]
decoded_string.decode(charset)
# u'amrishmishra_Entry Level Resume – 02.pdf'
But again, if you're trying to parse real-world email in Python 2.7, be aware that this is probably just the first of several problems you'll encounter.
The django-anymail package I maintain includes a compatibility version of email.parser.BytesParser that tries to work around several (but not all) other bugs in Python 2.7 email parsing. You may be able to borrow that (internal) code for your purposes. (Or since you tagged your question Django, you might want to look into Anymail's normalized inbound email handling, which includes Amazon SES support.)

Related

S3 InvalidDigest when calling the PutObject operation [duplicate]

I have tried to upload an XML File to S3 using boto3. As recommended by Amazon, I would like to send a Base64 Encoded MD5-128 Bit Digest(Content-MD5) of the data.
https://docs.aws.amazon.com/AmazonS3/latest/API/RESTObjectPUT.html
https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/s3.html#S3.Object.put
My Code:
with open(file, 'rb') as tempfile:
body = tempfile.read()
tempfile.close()
hash_object = hashlib.md5(body)
base64_md5 = base64.encodebytes(hash_object.digest())
response = s3.Object(self.bucket, self.key + file).put(
Body=body.decode(self.encoding),
ACL='private',
Metadata=metadata,
ContentType=self.content_type,
ContentEncoding=self.encoding,
ContentMD5=str(base64_md5)
)
When i try this the str(base64_md5) create a string like 'b'ZpL06Osuws3qFQJ8ktdBOw==\n''
In this case, I get this Error Message:
An error occurred (InvalidDigest) when calling the PutObject operation: The Content-MD5 you specified was invalid.
For Test purposes I copied only the Value without the 'b' in front: 'ZpL06Osuws3qFQJ8ktdBOw==\n'
Then i get this Error Message:
botocore.exceptions.HTTPClientError: An HTTP Client raised and unhandled exception: Invalid header value b'hvUe19qHj7rMbwOWVPEv6Q==\n'
Can anyone help me how to save Upload a File to S3?
Thanks,
Oliver
Starting with #Isaac Fife's example, stripping it down to identify what's required vs not, and to include imports and such to make it a full reproducible example:
(the only change you need to make is to use your own bucket name)
import base64
import hashlib
import boto3
contents = "hello world!"
md = hashlib.md5(contents.encode('utf-8')).digest()
contents_md5 = base64.b64encode(md).decode('utf-8')
boto3.client('s3').put_object(
Bucket="mybucket",
Key="test",
Body=contents,
ContentMD5=contents_md5
)
Learnings: first, the MD5 you are trying to generate will NOT look like what an 'upload' returns. We actually need a base64 version, it returns a md.hexdigest() version. hex is base16, which is not base64.
(Python 3.7)
Took me hours to figure this out because the only error you get is "The Content-MD5 you specified was invalid." Super useful for debugging... Anyway, here is the code I used to actually get the file to upload correctly before refactoring.
json_results = json_converter.convert_to_json(result)
json_results_utf8 = json_results.encode('utf-8')
content_md5 = md5.get_content_md5(json_results_utf8)
content_md5_string = content_md5.decode('utf-8')
metadata = {
"md5chksum": content_md5_string
}
s3 = boto3.resource('s3', config=Config(signature_version='s3v4'))
obj = s3.Object(bucket, 'filename.json')
obj.put(
Body=json_results_utf8,
ContentMD5=content_md5_string,
ServerSideEncryption='aws:kms',
Metadata=metadata,
SSEKMSKeyId=key_id)
and the hashing
def get_content_md5(data):
digest = hashlib.md5(data).digest()
return base64.b64encode(digest)
The hard part for me was figuring out what encoding you need at each step in the process and not being very familiar with how strings are stored in python at the time.
get_content_md5 takes a utf-8 bytes-like object only, and returns the same. But to pass the md5 hash to aws, it needs to be a string. You have to decode it before you give it to ContentMD5.
Pro-tip - Body on the other hand, needs to be given bytes or a seekable object. Make sure if you pass a seekable object that you seek(0) to the beginning of the file before you pass it to AWS or the MD5 will not match. For that reason, using bytes is less error prone, imo.

Python fetching mailboxes with unicode characters

I'm trying to write custom mail agent.
I am trying to fetch all mails, but my mailbox has polish letter in mailboxnames...
So this code (cut all prints from listing):
def parse_list_response(self, line):
list_response_pattern = re.compile(r'\((?P<flags>.*?)\) "(?P<delimiter>.*)" (?P<name>.*)')
line=line.decode(encoding='utf_8')
flags, delimiter, mailbox_name = list_response_pattern.match(line).groups()
mailbox_name = mailbox_name.strip('"')
return (flags, delimiter, mailbox_name)
def fetch_mails(self, from_who, since_when):
server = imaplib.IMAP4_SSL(self.hostname)
server.login(self.owner, self.password)
rc, mailboxes = server.list()
for line in mailboxes:
mailbox=self.parse_list_response(line)[2]
server.select(mailbox)
try:
messages = server.search('FROM "{}"'.format(from_who))
Gives me for example mailbox:
decoded = (\Flagged \HasNoChildren) "/" "[Gmail]/Oznaczone gwiazdk&AQU-"
See: &AQU-... it is polish "ą"
Question is how to get rid of this? I cannot find how to decode this bytecode
The encoding is IMAP4 Modified UTF-7, which is a convention used for international mailbox names, as defined in RFC3501, section 5.1.3.
Unfortunately, the imaplib module doesn't currently support it - although there are several issues on the python bug tracker that suggest that may change in the near future (e.g. issue 5305 and issue 22598).
Anyway, in the meantime, it looks like you will have to find a third-party package to handle this (e.g. imapclient).

Trailing equal signs (=) in emails

I download messages from a Gmail account using POP3 and save them in a SQLite database for futher processing:
mailbox = poplib.POP3_SSL('pop.gmail.com', '995')
mailbox.user(user)
mailbox.pass_(password)
msgnum = mailbox.stat()[0]
for i in range(msgnum):
msg = '\n'.join(mailbox.retr(i+1)[1])
save_message(msg, dbmgr)
mailbox.quit()
However, looking in the database, all lines but the last one of the message body (payload) have trailing equal signs. Do you know why this happens?
Frederic's link lead me to the answer. The encoding is called "quoted printable" (wiki) and it's possible to decode it using the quopri Python module (documentation):
msg.decode('quopri').decode('utf-8')
Update for python 3.x
You now have to invoke the codecs module.
import codecs
bytes_msg = bytes(msg, 'utf-8')
decoded_msg = codecs.decode(bytes_msg, 'quopri').decode('utf-8')

Unable to display Japanese (UTF-8) characters in email body with webbrowser

I am reading text from two different .txt files and concatenating them together. Then add that to a body of the email through by using webbrowser.
One text file is English characters (ascii) and the other Japanese (UTF-8). The text will display fine if I write it to a text file. But if I use webbrowser to insert the text into an email body the Japanese text displays as question marks.
I have tried running the script on multiple machines that have different mail clients as their defaults. Initially I thought maybe that was the issue, but that does not appear to be. Thunderbird and Mail (MacOSX) display question marks.
Hello. Today is 2014-05-09
????????????????2014-05-09????
I have looked at similar issues around on SO but they have not solved the issue.
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in
position 20: ordinal not in
range(128)
Japanese in python function
Printing out Japanese (Chinese) characters
python utf-8 japanese
Is there a way to have the Japanese (UTF-8) display in the body of an email created with webbrowser in python? I could use the email functionality but the requirement is the script needs to open the default mail client and insert all the information.
The code and text files I am using are below. I have simplified it to focus on the issue.
email-template.txt
Hello. Today is {{date}}
email-template-jp.txt
こんにちは。今日は {{date}} です。
Python Script
#
# -*- coding: utf-8 -*-
#
import sys
import re
import os
import glob
import webbrowser
import codecs,sys
sys.stdout = codecs.getwriter('utf8')(sys.stdout)
# vars
date_range = sys.argv[1:][0]
email_template_en = "email-template.txt"
email_template_jp = "email-template-jp.txt"
email_to_send = "email-to-send.txt" # finished email is saved here
# Default values for the composed email that will be opened
mail_list = "test#test.com"
cc_list = "test1#test.com, test2#test.com"
subject = "Email Subject"
# Open email templates and insert the date from the parameters sent in
try:
f_en = open(email_template_en, "r")
f_jp = codecs.open(email_template_jp, "r", "UTF-8")
try:
email_content_en = f_en.read()
email_content_jp = f_jp.read()
email_en = re.sub(r'{{date}}', date_range, email_content_en)
email_jp = re.sub(r'{{date}}', date_range, email_content_jp).encode("UTF-8")
# this throws an error
# UnicodeDecodeError: 'ascii' codec can't decode byte 0xe3 in position 26: ordinal not in range(128)
# email_en_jp = (email_en + email_jp).encode("UTF-8")
email_en_jp = (email_en + email_jp)
finally:
f_en.close()
f_jp.close()
pass
except Exception, e:
raise e
# Open the default mail client and fill in all the information
try:
f = open(email_to_send, "w")
try:
f.write(email_en_jp)
# Does not send Japanese text to the mail client. But will write to the .txt file fine. Unsure why.
webbrowser.open("mailto:%s?subject=%s&cc=%s&body=%s" %(mail_list, subject, cc_list, email_en_jp), new=1) # open mail client with prefilled info
finally:
f.close()
pass
except Exception, e:
raise e
edit: Forgot to add I am using Python 2.7.1
EDIT 2: Found a workable solution after all.
Replace your webbrowser call with this.
import subprocess
[... other code ...]
arg = "mailto:%s?subject=%s&cc=%s&body=%s" % (mail_list, subject, cc_list, email_en_jp)
subprocess.call(["open", arg])
This will open your default email client on MacOS. For other OSes please replace "open" in the subprocess line with the proper executable.
EDIT: I looked into it a bit more and Mark's comment above made me read the RFC (2368) for mailto URL scheme.
The special hname "body" indicates that the associated hvalue is the
body of the message. The "body" hname should contain the content for
the first text/plain body part of the message. The mailto URL is
primarily intended for generation of short text messages that are
actually the content of automatic processing (such as "subscribe"
messages for mailing lists), not general MIME bodies.
And a bit further down:
8-bit characters in mailto URLs are forbidden. MIME encoded words (as
defined in [RFC2047]) are permitted in header values, but not for any
part of a "body" hname."
So it looks like this is not possible as per RFC, although that makes me question why the JavaScript solution in the JSFiddle provided by naota works at all.
I leave my previous answer as is below, although it does not work.
I have run into same issues with Python 2.7.x quite a couple of times now and every time a different solution somehow worked.
So here are several suggestions that may or may not work, as I haven't tested them.
a) Force unicode strings:
webbrowser.open(u"mailto:%s?subject=%s&cc=%s&body=%s" % (mail_list, subject, cc_list, email_en_jp), new=1)
Notice the small u right after the opening ( and before the ".
b) Force the regex to use unicode:
email_jp = re.sub(ur'{{date}}', date_range, email_content_jp).encode("UTF-8")
# or maybe
email_jp = re.sub(ur'{{date}}', date_range, email_content_jp)
c) Another idea regarding the regex, try compiling it first with the re.UNICODE flag, before applying it.
pattern = re.compile(ur'{{date}}', re.UNICODE)
d) Not directly related, but I noticed you write the combined text via the normal open method. Try using the codecs.open here as well.
f = codecs.open(email_to_send, "w", "UTF-8")
Hope this helps.

How to send a file through Soap in python?

I want to send a zip file through SOAP (from a SOAP client to a SOAP server) in python.
Following the reading of this SO question, I choose to use suds as my python client.
But according to this, suds do not support sending attachment. A method is given to circumvent the problem but I've not been able to make it work. I'm puzzled over what I'm supposed to give as parameters.
Anyone know how to send a file through Soap in python ?
If needed I'll switch to another SOAP client library.
Download the provided wrapper, and then where you would normally say something like...
client.service.fooMethod(fooParam1,fooParam2,...)
...instead do...
soap_attachments.with_soap_attachment(client.service.fooMethod,binaryParam,fooParam1,fooParam2,...)
Where binaryParam is of the type expected by soap_attachements.py. For example, if you wanted to send a png image I think (never done this) you would do:
imageFile = open('imageFile.png','rb')
imageData = imageFile.read()
mimeType = 'image/png'
binaryParam = (imageData, uuid.uuid4(), mimeType)
Attachments are best way to send binary file through SOAP. If you can't use any other method but only SOAP, just encode your binaries with Base64 and paste it into SOAP method as a parameter. It isn't pure, but works great with small attachments. Large binaries? Use FTP, WebDAV and all others native ways for sending files between hosts.
I made the following changes to soap_attachments.py under suds to get my own uploads to work. You may not need some of the changes that I've made to this, but hopefully it'll at least give you a start.
--- /home/craig/Downloads/soap_attachments.py 2011-07-08 20:38:55.708038918 -0400
+++ soap_attachments.py 2011-06-21 10:29:50.090243052 -0400
## -1,4 +1,8 ##
+import uuid
+import re
def with_soap_attachment(suds_method, attachment_data, *args, **kwargs):
+ HUD_ARM_SERVICE_URL = suds_method.client.wsdl.url
+ HUD_ARM_SERVICE_URL = HUD_ARM_SERVICE_URL.replace('wsdl','xsd')
""" Add an attachment to a suds soap request.
attachment_data is assumed to contain a list:
## -16,7 +20,9 ##
soap_method = suds_method.method
if len(attachment_data) == 3:
+ print "here"
data, attachment_id, attachment_mimetype = attachment_data
+ attachment_id = uuid.uuid4()
elif len(attachment_data) == 2:
data, attachment_id = attachment_data
attachment_mimetype = MIME_DEFAULT
## -55,7 +61,7 ##
])
# Build the full request
- request_text = '\n'.join([
+ request_text = '\r\n'.join([
'',
'--%s' % boundary_id,
soap_headers,
I then use:
f = open(dir_path + infile,'rb')
data_file = f.read()
data_file_type = mimetypes.guess_type(infile)[0]
(filename,ext) = infile.split('.')
...
clientargs = [...]
identifier = with_soap_attachment(client.service.fooThing, [data_file, '1', data_file_type], credentials['foo'],credentials['bar'], morefoo)
You might not need all of these changes, but it's what got me going.
Hope this helps!

Categories