Unicode encoding in email - python

I have manually created and sent myself an html email in gmail. I want to be able to reuse this html output to programatically send it (using smtplib in python).
In gmail, I view the source, which appears like:
Mime-Version: 1.0 Content-Type: multipart/alternative;
boundary="--==_mimepart_57daadsdas2e101427152ee"; charset=UTF-8
----==_mimepart_57daadsdas2e101427152ee Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
Hi all !
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
Venez d=C3=A9couvrir
My problem is that when I then try to send this content as html programatically, it's not displayed correctly. I suspect it's because of unicode conversion. I can't convert back for example the characters "d=C3=A9couvrir" to what it should be: "découvrir".
Could anyone help?

There's are some MIME examples that are probably more suitable, but the simple answer from the headers is that it is UTF8 and quoted-printable encoding, so you can use the quopri module:
>>> quopri.decodestring('Venez d=C3=A9couvrir').decode('utf8')
'Venez découvrir'

Related

Unable to Set Content-Description in Mime Header in Python

I am sending multiple file attachments in a mail using python mime library. I am trying to set some values 'Content-Description' field using add_header function, but I am unable to set it.
https://docs.python.org/3/library/email.compat32-message.html#email.message.Message.add_header
Code Snippet
msg.add_header('Content-Type','text/html')
msg.add_header('Content-Disposition', 'attachment', filename="intrusion.html")
msg.add_header`('Content-`Description','This is an Mail Attachment')
Kindly advise how headers can be added.
The Content-Type and Content-Disposition headers are set automatically by EmailMessage.add_attachment. Additional headers, such as Content-Description, can be passed as a list of colon-separated "header-name:content" strings using the headers keyword argument. See the docs for ContentManager.set_content, which is called by add_attachment.
This example code:
from email.message import EmailMessage
# Create the container email message.
msg['Subject'] = 'This message has an attachment'
msg['From'] = 'me#example.com'
msg['To'] = 'you#example.com'
msg.add_attachment(
'<p>hello world</p>',
subtype='html',
filename='instrusion.html',
headers=['Content-Description:This is an attachment'],
)
print(msg.as_string())
Produces this output:
Subject: This message has an attachment
From: me#example.com
To: you#example.com
Content-Type: multipart/mixed; boundary="===============1754799949587534235=="
--===============1754799949587534235==
Content-Type: text/html; charset="utf-8"
Content-Description: This is an attachment
Content-Transfer-Encoding: 7bit
Content-Disposition: attachment; filename="instrusion.html"
MIME-Version: 1.0
<p>hello world</p>
--===============1754799949587534235==--

Handling POSTed multipart/form-data file

I'm wondering what is the best way to handle POSTed raw data on the server side.
So I'm using Falconframework and I'm able to receive user submitted file
-----------------------------1209846671886287098156775745
Content-Disposition: form-data; name="qquuid"
d3ad452e-a287-4cb7-ac1f-f0a5cdb54386
-----------------------------1209846671886287098156775745
Content-Disposition: form-data; name="qqfilename"
Screenshot.png
-----------------------------1209846671886287098156775745
Content-Disposition: form-data; name="qqtotalfilesize"
1951677
-----------------------------1209846671886287098156775745
Content-Disposition: form-data; name="qqfile"; filename="Screenshot.png"
Content-Type: image/png
�PNG
.................lots of bites............
Using python and hopefully some other lib i would like to turn it into some sort of file object which i can extract metadata - filename , uuid etc, as well as the file itself.
Which lib should i use?
Here is a middle ware project that looks promising I'm currently trying to implement this myself in a falcon service.
falcon-multipart
I have have pretty good luck as well using cgi.FeildStorage(). As found in the following post.
cgi article
import cgi
def on_post(req, resp):
env = req.env
env.setdefault('QUERY_STRING','')
form = cgi.FieldStorage(fp=req.stream,environ=env)
form['fileinputname'].file
If you are willing to have one non falcon hook here is an example with bottle:
example
Just a very late followup to this old discussion.
As of Falcon 3.0, the framework supports multipart/form-data natively for both WSGI and ASGI applications.

POST request with Multipart/form-data. Content-type not correct

We're trying to write a script with python (using python-requests a.t.m.) to do a POST request to a site where the content has to be MultipartFormData.
When we do this POST request manually (by filling in the form on the site and post), using wireshark, this came up (short version):
Content-Type: multipart/form-data;
Content-Disposition: form-data; name="name"
Data (8 Bytes)
John Doe
When we try to use the python-requests library for achieving the same result, this is sent:
Content-Type: application/x-pandoplugin
Content-Disposition: form-data; name="name"; filename="name"\r\n
Media type: application/x-pandoplugin (12 Bytes)
//and then in this piece is what we posted://
John Doe
The weird thing is that the 'general type' of the packet indeed is multipart/form-data, but the individual item sent (key = 'name', value= 'John Doe') has type application/x-pandoplugin (a random application on my pc I guess).
This is the code used:
response = s.post('http://url.com', files={'name': 'John Doe'})
Is there a way to specify the content-type of the individual items instead of using the headers argument (which only changes the type of the 'whole' packet)?
We think the server doesn't respond correctly due to the fact that it can't understand the content-type we send it.
Little update:
I think the different parts of the multipart content are now identical to the ones sent if I do the POST in the browser, so that's good. Still the server doesn't actually do the changes I send it with the script. The only thing that still is different is the order of the different parts.
For example this is what my browser sends:
Boundary: \r\n------WebKitFormBoundary3eXDYO1lG8Pgxjwj\r\n
Encapsulated multipart part: (text/plain)
Content-Disposition: form-data; name="file"; filename="ex.txt"\r\n
Content-Type: text/plain\r\n\r\n
Line-based text data: text/plain
lore ipsum blabbla
Boundary: \r\n------WebKitFormBoundary3eXDYO1lG8Pgxjwj\r\n
Encapsulated multipart part:
Content-Disposition: form-data; name="seq"\r\n\r\n
Data (2 bytes)
Boundary: \r\n------WebKitFormBoundary3eXDYO1lG8Pgxjwj\r\n
Encapsulated multipart part:
Content-Disposition: form-data; name="name"\r\n\r\n
Data (2 bytes)
And this is what the script (using python-requests) sends:
Boundary: \r\n------WebKitFormBoundary3eXDYO1lG8Pgxjwj\r\n
Encapsulated multipart part:
Content-Disposition: form-data; name="name"\r\n\r\n
Data (2 bytes)
Boundary: \r\n------WebKitFormBoundary3eXDYO1lG8Pgxjwj\r\n
Encapsulated multipart part: (text/plain)
Content-Disposition: form-data; name="file"; filename="ex.txt"\r\n
Content-Type: text/plain\r\n\r\n
Line-based text data: text/plain
lore ipsum blabbla
Boundary: \r\n------WebKitFormBoundary3eXDYO1lG8Pgxjwj\r\n
Encapsulated multipart part:
Content-Disposition: form-data; name="seq"\r\n\r\n
Data (2 bytes)
Could it be possible that the server counts on the order of the parts? According to Multipart upload form: Is order guaranteed?, it apparently is? And if so, is it possible to explicitly force an order using the requests library?
And to make things worse in that case: There is a mixture of a file and just text values.
So forcing an order seems rather difficult. This is the current way I do it:
s.post('http://www.url.com', files=files,data = form_values)
EDIT2:
I did a modification in the requests plugin to make sure the order of the parts is the same as in the original request. This doesn't fix the problem so I guess there is no straightforward solution for my problem. I'll send a mail to the devs of the site and hope they can help me!
your code looks correct.
requests.post('http://url.com', files={'name': 'John Doe'})
... and should send a 'multipart/form-data' Post.
and indeed, I get something like this posted:
Accept-Encoding: gzip, deflate, compress
Connection: close
Accept: */*
Content-Length: 188
Content-Type: multipart/form-data; boundary=032a1ab685934650abbe059cb45d6ff3
User-Agent: python-requests/1.2.3 CPython/2.7.4 Linux/3.8.0-27-generic
--032a1ab685934650abbe059cb45d6ff3
Content-Disposition: form-data; name="name"; filename="name"
Content-Type: application/octet-stream
John Doe
--032a1ab685934650abbe059cb45d6ff3--
I have no idea why you'd get that weird Content-Type header:
Content-Type: application/x-pandoplugin
I would begin by removing Pando Web Plugin from your machine completely, and then try your python-requests code again. (or try from a different machine)
As of today you can do:
response = s.post('http://url.com', files={'name': (filename, contents, content_type)})
Python uses a system-wide configuration file to "guess" the mime-type of a file. If those plugins are registering your file extension with their custom mime-type you'll end up putting that in instead.
The safest approach is make your own mime type guessing that suits the particular server you're sending do, and only use the native python mime type guessing for extensions you didn't think of.
How exactly you specify the content-type manually with python-requests I don't know, but I expect it should be possible.

Can I use curl to test receiving email

I would like an automated way to test how my app handles email, with attachments.
Firstly I modified my app (on App Engine) to log the contents of the request body for a received message (as sent through appspotmail). I copied these contents into a file called test_mail.txt
I figured I could post this file to imitate the inbound mail tester, something like so.
curl --header "Content-Type:message/rfc822" -X POST -d #test_mail.txt http://localhost:8080/_ah/mail/test#example.com
Whenever I do this, the message isn't properly instantiated, and I get an exception when I refer to any of the standard attributes.
Am I missing something in how I am using curl?
I run into the same problem using a simpler email, as posted by _ah/admin/inboundmail
MIME-Version: 1.0
Date: Wed, 25 Apr 2012 15:50:06 +1000
From: test#example.com
To: test#example.com
Subject: Hello
Content-Type: multipart/alternative; boundary=cRtRRiD-6434410
--cRtRRiD-6434410
Content-Type: text/plain; charset=UTF-8
There
--cRtRRiD-6434410
Content-Type: text/html; charset=UTF-8
There
--cRtRRiD-6434410--
Try --data-binary instead of -d as the flag for the input file. When I tried with your flags, it looked like curl stripped the carriage returns out of the input file, which meant the MIME parser choked on the POST data.
Black,
I noticed your program is written in python? Why not use twisted to create a tiny smtp client?
Here are a few examples..
http://twistedmatrix.com/documents/current/mail/tutorial/smtpclient/smtpclient.html

problem with unicode decoding

This is funny.. I am trying to read geographic lookup data from openstreetmap. The code that performs the query looks like this
params = urllib.urlencode({'q': ",".join([e for e in full_address]), 'format': "json", "addressdetails" : "1"})
query = "http://nominatim.openstreetmap.org/search?%s" % params
print query
time.sleep(5)
response = json.loads(unicode(urllib.urlopen(query).read(), "UTF-8"), encoding="UTF-8")
print response
The query for Zürich is correctly URL-encoded on UTF-8 data. No wonders here.
http://nominatim.openstreetmap.org/search?q=Z%C3%BCrich%2CSWITZERLAND&addressdetails=1&format=json
When I print the response, the u with umlaut is encoded latin1 (0xFC)
[{u'display_name': u'Z\xfcrich, Bezirk Z\xfcrich, Z\xfcrich, Schweiz, Europe', u'place_id': 588094, u'lon': 8.540443
but that's nonsense because openstreetmap returns the JSON data in UTF-8
Connecting to nominatim.openstreetmap.org (nominatim.openstreetmap.org)|128.40.168.106|:80... connected.
HTTP request sent, awaiting response...
HTTP/1.1 200 OK
Date: Wed, 26 Jan 2011 13:48:33 GMT
Server: Apache/2.2.14 (Ubuntu)
Content-Location: search.php
Vary: negotiate
TCN: choice
X-Powered-By: PHP/5.3.2-1ubuntu4.7
Access-Control-Allow-Origin: *
Content-Length: 3342
Keep-Alive: timeout=15, max=100
Connection: Keep-Alive
Content-Type: application/json; charset=UTF-8
Length: 3342 (3.3K) [application/json]
which is also confirmed by the file contents, and then I explicitly say that it's UTF-8 both at read and json parsing.
What's going on here ?
EDIT : apparently it's the json.loads that screws up somehow.
When I go and print the response, the
u with umlaut is encoded latin1 (0xFC)
You are just misinterpreting the output. It's a unicode string (you can tell by the u in prefix), there's no encoding "attached" - the \xFC means there it's the codepoint with number 0xFC, which happens to be the U-Umlaut (see http://www.fileformat.info/info/unicode/char/fc/index.htm). The reason why this happens is that the numbering of the first 256 unicode codepoints coincides with the latin1 encoding.
In short, you did everything right - you have a unicode object with the right content (that is agnostic to encodings), you can choose the encoding you want when you use that content for output somewhere by doing unicodestr.encode("utf-8") or by using codecs, see http://docs.python.org/howto/unicode.html#reading-and-writing-unicode-data
The output is fine. Whenever you print data on the console, Python encondes Unicode the data only when printing the actual string. If you print a list of unicodes, each unicode string is show on the console as its repr():
>>> a=u'á'
>>> a
u'\xe1'
>>> print a
á
>>> [a]
[u'\xe1']
>>> print [a]
[u'\xe1']

Categories