Email quoted printable encoding confusion

Email quoted printable encoding confusion - python

I'm constructing MIME encoded emails with Python and I'm getting a difference with the same email that is MIME encoded by Amazon's SES.
I'm encoding using utf-8 and quoted-printable.
For the character "å" (that's the letter "a" with a little circle on top), my encoding produces
=E5
and the other encoding produces
=C3=A5
They both look ok in my gmail, but I find it weird that the encoding is different. Is one of these right and the other wrong in any way?
Below is my Python code in case that helps.
====
cs = charset.Charset('utf-8')
cs.header_encoding = charset.QP
cs.body_encoding = charset.QP
# See https://stackoverflow.com/a/16792713/136598
mt = mime.text.MIMEText(None, subtype)
mt.set_charset(cs)
mt.replace_header("content-transfer-encoding", "quoted-printable")
mt.set_payload(mt._charset.body_encode(payload))

Ok, I was able to figure this out, thanks to Artur's comment.
The utf-8 encoding of the character is two bytes and not one so you should expect to see two quoted printable encodings and not one so the AWS SES encoding is correct (not surprisingly).
I was sending unicode text and not utf-8 which causes only one quoted printable character. It turns out that it worked because gmail supports unicode.
For the Python code in my question, I need to manually encode the text as utf-8. I was thinking that MIMEText would do that for me but it does not.

Related

Finding second encoding of base64 XML string

I have some base64 encoded text fields in some XML data.
To get all the characters showing correctly, I think I need to find an additional encoding used on this text, which is not UTF-8 by the look of it. ?And maybe some other encoding aspect too, not sure..
I am not sure what order I should be encoding and decoding here - following https://www.geeksforgeeks.org/encoding-and-decoding-base64-strings-in-python/ I tried to first:
Encode the whole string with every possible Python2.7 encoding, then
decode with base64
(same result each time, no standard representation of problem characters)
Then I tried:
encode string with utf8
decode with base64
decode the bytes string with every possible Python2.7 encoding
However, none of these answer strings seem to get any standard representation of the problem characters, which should display as 'é' and 'ü'.
I enclose this example string, where I am sure what the final correct text should be.
Original base64 string: b64_encoded_bytes = 'R3KfbmRlciBGco5kjnJpYyBKb3Vzc2V0JiMxMzsmIzEzO3NlbGVjdGlvbiBjb21taXR0ZWUgZm9yIGFydGlzdCByZWNpZGVuY3k6IFZpbmNpYW5jZSBEZXNwcmV0LCBLb3lvIEtvdW9oLCBDaHJpc3RpbmUgbWFjZWwsIEhhbnMtVWxyaWNoIE9icmlzdCwgTmF0YT9hIFBldHJlP2luLUJhY2hlbGV6LCBQaGlsaXBwZSBWZXJnbmU='
Text string with correct 'é' and 'ü' characters at beginning, deduced from European language knowledge:
'Gründer Frédéric Jousset

selection committee for artist
recidency: Vinciance Despret, Koyo Kouoh, Christine macel, Hans-Ulrich
Obrist, Nata?a Petre?in-Bachelez, Philippe Vergne'
Note the '
' is HTML encoding of apparently new line character used in Windows, and '?' might also resolve to another correct character with correct encoding, or possibly '?' is actual display in original data.

It seems to be encoded with mac_roman:
>>> b64 = 'R3KfbmRlciBGco5kjnJpYyBKb3Vzc2V0JiMxMzsmIzEzO3NlbGVjdGlvbiBjb21taXR0ZWUgZm9yIGFydGlzdCByZWNpZGVuY3k6IFZpbmNpYW5jZSBEZXNwcmV0LCBLb3lvIEtvdW9oLCBDaHJpc3RpbmUgbWFjZWwsIEhhbnMtVWxyaWNoIE9icmlzdCwgTmF0YT9hIFBldHJlP2luLUJhY2hlbGV6LCBQaGlsaXBwZSBWZXJnbmU='
>>> bs = base64.b64decode(b64)
>>> bs
b'Gr\x9fnder Fr\x8ed\x8eric Jousset

selection committee for artist recidency: Vinciance Despret, Koyo Kouoh, Christine macel, Hans-Ulrich Obrist, Nata?a Petre?in-Bachelez, Philippe Vergne'
>>> print(bs.decode('mac_roman'))
Gründer Frédéric Jousset

selection committee for artist recidency: Vinciance Despret, Koyo Kouoh, Christine macel, Hans-Ulrich Obrist, Nata?a Petre?in-Bachelez, Philippe Vergne
The question marks in "Nata?a Petre?in-Bachelez" are present in the original data, presumably the result of a previous encoding/decoding problem.

Python gpgme non-ascii text handling

I am trying to encrypt-decrypt a text via GPG using pygpgme, while it works for western characters decryption fails on a Russian text. I use GPG suite on Mac to decrypt e-mail.
Here's the code I use to produce encrypted e-mail body, note that I tried to encode message in Unicode but it didn't make any difference. I use Python 2.7.
Please help, I must say I am new to Python.
ctx = gpgme.Context()
ctx.armor = True
key = ctx.get_key('0B26AE38098')
payload = 'Просто тест'
#plain = BytesIO(payload.encode('utf-8'))
plain = BytesIO(payload)
cipher = BytesIO()
ctx.encrypt([key], gpgme.ENCRYPT_ALWAYS_TRUST, plain, cipher)

There are multiple problems here. You really should read the Unicode HOWTO, but I'll try to explain.
payload = 'Просто тест'
Python 2.x source code is, by default, Latin-1. But your source clearly isn't Latin-1, because Latin-1 doesn't even have those characters. What happens if you write Просто тест in one program (like a text editor) as UTF-8, then read it in another program (like Python) as Latin-1? You get ÐÑÐ¾ÑÑÐ¾ ÑÐµÑÑ. So, what you're doing is creating a string full of nonsense. If you're using ISO-8859-5 rather than UTF-8, it'll be different nonsense, but still nonsense
So, first and foremost, you need to find out what encoding you did use in your text editor. It's probably UTF-8, if you're on a Mac, but don't just guess; find out.
Second, you have to tell Python what encoding you used. You do that by using an encoding declaration. For example, if your text editor uses UTF-8, add this line to the top of your code:
# coding=utf-8
One you fix that, payload will be a byte string, encoded in whatever encoding your text editor uses. But you can't encode already-encoded byte strings, only Unicode strings.
Python 2.x will let you call encode on them anyway, but it's not very useful—what it will do is first decode the string to Unicode using sys.getdefaultencoding, so it can then encode that. That's unlikely to be what you want.
The right way to fix this is to make payload a Unicode string in the first place, by using a Unicode literal. Like this:
payload = u'Просто тест'
Now, finally, you can actually encode the payload to UTF-8, which you did perfectly correctly in your first attempt:
plain = BytesIO(payload.encode('utf-8'))
Finally, you're encrypting UTF-8 plain text with GPG. When you decrypt it on the other side, make sure to decode it as UTF-8 there as well, or again you'll probably see nonsense.

Unknown Encoding in IMAP Message

I am obtaining text/HTML BODY parts of email messages using the IMAP protocol.
For this, what I do is use the BODYSTRUCTURE call to obtain the BODY index and the charset of a part, then use the BODY[INDEX] call, obtain the raw text, and try to decode it using the Python decode function.
Now my problem is, even after decoding some text parts with the given charsets (charset obtained from the BODYSTRUCTURE call together with that part), they are still encoded with some unknown encoding.
Only Portuguese/Spanish/other latin text comes with this problem, and therefore I assume this is some kind of Portuguese/Spanish encoding.
Now my problem is, how do I detect this occurrence and properly decode it? First of all I assume decoding the text with the given charset should leave no encoded characters, but if that does happen, as it is happening right now, how do I find a universal way to decode these characters?
I assume I could just try a list of common charsets and do a try: except: cycle for all of those to try and decode the given text, but I would honestly prefer a better solution.
Pseudocode is something like this:
# Obtain BODYSTRUCTURE call
data, result = imap_instance.uid('fetch', email_uid, '(BODYSTRUCTURE)')
part_body_index, part_charset = parse_BODY_index_and_charset_from_response(data)
text_part, result = imap_instance.uid('fetch', email_uid, '(BODY['+str(part_body_index)+'])')
if len(part_charset) > 0:
try:
text_part = text_part.decode(part_charset, 'ignore')
except:
pass
# Content of "text_part" variable after this should be text with no encoded characters...
# But that's not the case
Examples of encoded text:
A 05/04/2013, =E0s 11:09, XYZ escreveu:>
This text was encoded with iso-8859-1, decoded it and still like this. Symbol =E0 in string is character "À".
In=EDcio da mensagem reenviada:
This text was encoded with windows-1252, decoded it and still like this. Symbol =ED in string is character "í".

You need to look at the Content-Transfer-Encoding information (which is actually returned in the BODYSTRUCTURE responses). You'll need to support both base64 and quoted-printable decoding -- this transforms the binary data (like UTF-8 or even ISO-8859-1 encoding of a given text) into a 7bit form which is safe for an e-mail transfer. Only after you've undone the content transfer encoding should you go ahead and decode the text from a character encoding (like UTF-8, or windows-1250, or ISO-8859-x, or...) to its Unicode representation that you work with.
Both of your examples are encoded using quoted-printable.

Python string encoding issue

I am using the Amazon MWS API to get the sales report for my store and then save that report in a table in the database. Unfortunately I am getting an encoding error when I try to encode the information as Unicode. After looking through the report (exactly as amazon sent it) I saw this string which is the location of the buyer:
'S�o Paulo'
so I tried to encode it like so:
encodeme = 'S�o Paulo'
encodeme.encode('utf-8)
but got the following error
UnicodeDecodeError: 'ascii' codec can't decode byte 0xef in position 1: ordinal not in range(128)
The whole reason why I am trying to encode it is because as soon as Django sees the � character it throws a warning and cuts off the string, meaning that the location is saved as S instead of
São Paulo
Any help is appreciated.

It looks like you are having some kind of encoding problem.
First, you should be very certain what encoding Amazon is using in the report body they send you. Is it UTF-8? Is it ISO 8859-1? Something else?
Unfortunately the Amazon MWS Reports API documentation, especially their API Reference, is not very forthcoming about what encoding they use. They only encoding I see them mention is UTF-8, so that should be your first guess. The GetReport API documentation (p.36-37) describes the response element Report as being type xs:string, but I don't see where they define that data type. Maybe they mean XML Schema's string datatype.
So, I suggest you save the byte sequence you are receiving as your report body from Amazon in a file, with zero transformations. Be aware that your code which calls AWS might be modifying the report body string inadvertently. Examine the non-ASCII bytes in that file with a binary editor. Is the "São" of "São" stored as S\xC3\xA3o, indicating UTF-8 encoding? Or is it stored as S\xE3o, indicating ISO 8859-1 encoding?
I'm guessing that you receive your report as a flat file. The Amazon AWS documentation says that you can request reports be delivered to you as XML. This would have the advantage of giving you a reply with an explicit encoding declaration.
Once you know the encoding of the report body, you now need to handle it properly. You imply that you are using the Django framework and Python language code to receive the report from Amazon AWS.
One thing to get very clear (as Skirmantas also explains):
Unicode strings hold characters. Byte strings hold bytes (octets).
Encoding converts a Unicode string into a byte string.
Decoding converts a byte string into a Unicode string.
The string you get from Amazon AWS is a byte string. You need to decode it to get a Unicode string. But your code fragment, encodeme = 'São Paulo', gives you a byte string. encodeme.encode('utf-8) performs an encode() on the byte string, which isn't what you want. (The missing closing quote on 'utf-8 doesn't help.)
Try this example code:
>>> reportbody = 'S\xc3\xa3o Paulo' # UTF-8 encoded byte string
>>> reportbody.decode('utf-8') # returns a Unicode string, u'...'
u'S\xe3o Paulo'
You might find some background reading helpful. I agree with Hoxieboy that you should take the time to read Python's Unicode HOWTO. Also check out the top answers to What do I need to know about Unicode?.

I think you have to decode it using a correct encoding rather than encode it to utf-8. Try
s = s.decode('utf-8')
However you need to know which encoding to use. Input can come in other encodings that utf-8.
The error which you received UnicodeDecodeError means that your object is not unicode, it is a bytestring. When you do bytestring.encode, the string firstly is decoded into unicode object with default encoding (ascii) and only then it is encoded with utf-8.
I'll try to explain the difference of unicode string and utf-8 bytestring in python.
unicode is a python's datatype which represents a unicode string. You use unicode for most of string operations in your program. Python probably uses utf-8 in its internals though it could also be utf-16 and this doesn't matter for you.
bytestring is a binary safe string. It can be of any encoding. When you receive data, for example you open a file, you get a bytestring and in most cases you will want to decode it to unicode. When you write to file you have to encode unicode objects into bytestrings. Sometimes decoding/encoding is done for you by a framework or library. Not always however framework can do this because not always framework can known which encoding to use.
utf-8 is an encoding which can correctly represent any unicode string as a bytestring. However you can't decode any kind of bytestring with utf-8 into unicode. You need to know what encoding is used in the bytestring to decode it.

Official Python unicode documentation
You might try that webpage if you haven't already and see if you can get the answer you're looking for ;)

how to url-safe encode a string with python? and urllib.quote is wrong

Hello i was wondering if you know any other way to encode a string to a url-safe, because urllib.quote is doing it wrong, the output is different than expected:
If i try
urllib.quote('á')
i get
'%C3%A1'
But thats not the correct output, it should be
%E1
As demostrated by the tool provided here this site
And this is not me being difficult, the incorrect output of quote is preventing the browser to found resources, if i try
urllib.quote('\images\á\some file.jpg')
And then i try with the javascript tool i mentioned i get this strings respectively
%5Cimages%5C%C3%A1%5Csome%20file.jpg
%5Cimages%5C%E1%5Csome%20file.jpg
Note how is almost the same but the url provided by quote doesn't work and the other one it does.
I tried messing with encode('utf-8) on the string provided to quote but it does not make a difference.
I tried with other spanish words with accents and the ñ they all are differently represented.
Is this a python bug?
Do you know some module that get this right?

According to RFC 3986, %C3%A1 is correct. Characters are supposed to be converted to an octet stream using UTF-8 before the octet stream is percent-encoded. The site you link is out of date.
See Why does the encoding's of a URL and the query string part differ? for more detail on the history of handling non-ASCII characters in URLs.

Ok, got it, i have to encode to iso-8859-1 like this
word = u'á'
word = word.encode('iso-8859-1')
print word

Python is interpreted in ASCII by default, so even though your file may be encoded differently, your UTF-8 char is interpereted as two ASCII chars.
Try putting a comment as the first of second line of your code like this to match the file encoding, and you might need to use u'á' also.
# coding: utf-8

What about using unicode strings and the numeric representation (ord) of the char?
>>> print '%{0:X}'.format(ord(u'á'))
%E1

In this question it seems some guy wrote a pretty large function to convert to ascii urls, thats what i need. But i was hoping there was some encoding tool in the std lib for the job.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.