Unknown Encoding in IMAP Message - python

I am obtaining text/HTML BODY parts of email messages using the IMAP protocol.
For this, what I do is use the BODYSTRUCTURE call to obtain the BODY index and the charset of a part, then use the BODY[INDEX] call, obtain the raw text, and try to decode it using the Python decode function.
Now my problem is, even after decoding some text parts with the given charsets (charset obtained from the BODYSTRUCTURE call together with that part), they are still encoded with some unknown encoding.
Only Portuguese/Spanish/other latin text comes with this problem, and therefore I assume this is some kind of Portuguese/Spanish encoding.
Now my problem is, how do I detect this occurrence and properly decode it? First of all I assume decoding the text with the given charset should leave no encoded characters, but if that does happen, as it is happening right now, how do I find a universal way to decode these characters?
I assume I could just try a list of common charsets and do a try: except: cycle for all of those to try and decode the given text, but I would honestly prefer a better solution.
Pseudocode is something like this:
# Obtain BODYSTRUCTURE call
data, result = imap_instance.uid('fetch', email_uid, '(BODYSTRUCTURE)')
part_body_index, part_charset = parse_BODY_index_and_charset_from_response(data)
text_part, result = imap_instance.uid('fetch', email_uid, '(BODY['+str(part_body_index)+'])')
if len(part_charset) > 0:
try:
text_part = text_part.decode(part_charset, 'ignore')
except:
pass
# Content of "text_part" variable after this should be text with no encoded characters...
# But that's not the case
Examples of encoded text:
A 05/04/2013, =E0s 11:09, XYZ escreveu:>
This text was encoded with iso-8859-1, decoded it and still like this. Symbol =E0 in string is character "À".
In=EDcio da mensagem reenviada:
This text was encoded with windows-1252, decoded it and still like this. Symbol =ED in string is character "í".

You need to look at the Content-Transfer-Encoding information (which is actually returned in the BODYSTRUCTURE responses). You'll need to support both base64 and quoted-printable decoding -- this transforms the binary data (like UTF-8 or even ISO-8859-1 encoding of a given text) into a 7bit form which is safe for an e-mail transfer. Only after you've undone the content transfer encoding should you go ahead and decode the text from a character encoding (like UTF-8, or windows-1250, or ISO-8859-x, or...) to its Unicode representation that you work with.
Both of your examples are encoded using quoted-printable.

Related

Finding second encoding of base64 XML string

I have some base64 encoded text fields in some XML data.
To get all the characters showing correctly, I think I need to find an additional encoding used on this text, which is not UTF-8 by the look of it. ?And maybe some other encoding aspect too, not sure..
I am not sure what order I should be encoding and decoding here - following https://www.geeksforgeeks.org/encoding-and-decoding-base64-strings-in-python/ I tried to first:
Encode the whole string with every possible Python2.7 encoding, then
decode with base64
(same result each time, no standard representation of problem characters)
Then I tried:
encode string with utf8
decode with base64
decode the bytes string with every possible Python2.7 encoding
However, none of these answer strings seem to get any standard representation of the problem characters, which should display as 'é' and 'ü'.
I enclose this example string, where I am sure what the final correct text should be.
Original base64 string: b64_encoded_bytes = 'R3KfbmRlciBGco5kjnJpYyBKb3Vzc2V0JiMxMzsmIzEzO3NlbGVjdGlvbiBjb21taXR0ZWUgZm9yIGFydGlzdCByZWNpZGVuY3k6IFZpbmNpYW5jZSBEZXNwcmV0LCBLb3lvIEtvdW9oLCBDaHJpc3RpbmUgbWFjZWwsIEhhbnMtVWxyaWNoIE9icmlzdCwgTmF0YT9hIFBldHJlP2luLUJhY2hlbGV6LCBQaGlsaXBwZSBWZXJnbmU='
Text string with correct 'é' and 'ü' characters at beginning, deduced from European language knowledge:
'Gründer Frédéric Jousset

selection committee for artist
recidency: Vinciance Despret, Koyo Kouoh, Christine macel, Hans-Ulrich
Obrist, Nata?a Petre?in-Bachelez, Philippe Vergne'
Note the '
' is HTML encoding of apparently new line character used in Windows, and '?' might also resolve to another correct character with correct encoding, or possibly '?' is actual display in original data.
It seems to be encoded with mac_roman:
>>> b64 = 'R3KfbmRlciBGco5kjnJpYyBKb3Vzc2V0JiMxMzsmIzEzO3NlbGVjdGlvbiBjb21taXR0ZWUgZm9yIGFydGlzdCByZWNpZGVuY3k6IFZpbmNpYW5jZSBEZXNwcmV0LCBLb3lvIEtvdW9oLCBDaHJpc3RpbmUgbWFjZWwsIEhhbnMtVWxyaWNoIE9icmlzdCwgTmF0YT9hIFBldHJlP2luLUJhY2hlbGV6LCBQaGlsaXBwZSBWZXJnbmU='
>>> bs = base64.b64decode(b64)
>>> bs
b'Gr\x9fnder Fr\x8ed\x8eric Jousset

selection committee for artist recidency: Vinciance Despret, Koyo Kouoh, Christine macel, Hans-Ulrich Obrist, Nata?a Petre?in-Bachelez, Philippe Vergne'
>>> print(bs.decode('mac_roman'))
Gründer Frédéric Jousset

selection committee for artist recidency: Vinciance Despret, Koyo Kouoh, Christine macel, Hans-Ulrich Obrist, Nata?a Petre?in-Bachelez, Philippe Vergne
The question marks in "Nata?a Petre?in-Bachelez" are present in the original data, presumably the result of a previous encoding/decoding problem.

Email quoted printable encoding confusion

I'm constructing MIME encoded emails with Python and I'm getting a difference with the same email that is MIME encoded by Amazon's SES.
I'm encoding using utf-8 and quoted-printable.
For the character "å" (that's the letter "a" with a little circle on top), my encoding produces
=E5
and the other encoding produces
=C3=A5
They both look ok in my gmail, but I find it weird that the encoding is different. Is one of these right and the other wrong in any way?
Below is my Python code in case that helps.
====
cs = charset.Charset('utf-8')
cs.header_encoding = charset.QP
cs.body_encoding = charset.QP
# See https://stackoverflow.com/a/16792713/136598
mt = mime.text.MIMEText(None, subtype)
mt.set_charset(cs)
mt.replace_header("content-transfer-encoding", "quoted-printable")
mt.set_payload(mt._charset.body_encode(payload))
Ok, I was able to figure this out, thanks to Artur's comment.
The utf-8 encoding of the character is two bytes and not one so you should expect to see two quoted printable encodings and not one so the AWS SES encoding is correct (not surprisingly).
I was sending unicode text and not utf-8 which causes only one quoted printable character. It turns out that it worked because gmail supports unicode.
For the Python code in my question, I need to manually encode the text as utf-8. I was thinking that MIMEText would do that for me but it does not.

Identify garbage unicode string using python

My script is reads data from csv file, the csv file can have multiple strings of English or non English words.
Some time the text file has garbage strings , i want to identify those string and skip those string and process others
doc = codecs.open(input_text_file, "rb",'utf_8_sig')
fob = csv.DictReader(doc)
for row, entry in enumerate(f):
if is_valid_unicode_str(row['Name']):
process_futher
def is_valid_unicode_str(value):
try:
function
return True
except UnicodeEncodeError:
return false
csv input:
"Name"
"袋è¢âdcx€¹Ã¤Â¸Å½Ã¦Å“‹å‹们çâ€ÂµÃ¥Â­Âå•â€"
"元大寶來證券"
"John Dove"
I want to defile function is_valid_unicode_str() which will identify the garbage string and process valid one only.
I tried to use decode is but it doesnt failed while decoding garbage strings
value.decode('utf8')
The expected output are string with Chinese and English string to be process
could you please guide me how can i implement function to filter valid Unicode files?.
(ftfy developer here)
I've figured out that the text is likely to be '袋袋与朋友们电子商'. I had to guess at the characters 友, 子, and 商, because some unprintable characters are characters missing in the string in your question. When guessing, I picked the most common character from the small number of possibilities. And I don't know where the "dcx" goes or why it's there.
Google Translate is not very helpful here but it seems to mean something about e-commerce.
So here's everything that happened to your text:
It was encoded as UTF-8 and decoded incorrectly as sloppy-windows-1252, twice
It had the letters "dcx" inserted into the middle of a UTF-8 sequence
Characters that don't exist in windows-1252 -- with byte values 81, 8d, 8f, 90, and 9d -- were removed
A non-breaking space (byte value a0) was removed from the end
If just the first problem had happened, ftfy.fix_text_encoding would be able to fix it. It's possible that the remaining problems just happened while you were trying to get the string onto Stack Overflow.
So here's my recommendation:
Find out who keeps decoding the data incorrectly as sloppy-windows-1252, and get them to decode it as UTF-8 instead.
If you end up with a string like this again, try ftfy.fix_text_encoding on it.
You have Mojibake strings; text encoded to one (correct) codec, then decoded as another.
In this case, your text was decoded with the Windows 1252 codepage; the U+20AC EURO SIGN in the text is typical of CP1252 Mojibakes. The original encoding could be one of the GB* family of Chinese encodings, or a multiple roundtrip UTF-8 - CP1252 Mojibake. Which one I cannot determine, I cannot read Chinese, nor do I have your full data; CP1252 Mojibakes include un-printable characters like 0x81 and 0x8D bytes that might have gotten lost when you posted your question here.
I'd install the ftfy project; it won't fix GB* encodings (I requested the project add support), but it includes a new codec called sloppy-windows-1252 that'll let you reverse an erroneous decode with that codec:
>>> import ftfy # registers extra codecs on import
>>> text = u'袋è¢âdcx€¹Ã¤Â¸Å½Ã¦Å“‹å‹们çâ€ÂµÃ¥Â­Âå•â€'
>>> print text.encode('sloppy-windows-1252').decode('gb2312', 'replace')
猫垄�姑�⑩dcx�盲赂沤忙��姑ヂ�姑ぢ宦�р�得ヂ�氓�⑩�
>>> print text.encode('sloppy-windows-1252').decode('gbk', 'replace')
猫垄鈥姑�⑩dcx�盲赂沤忙艙鈥姑ヂ鈥姑ぢ宦�р�得ヂ�氓鈥⑩�
>>> print text.encode('sloppy-windows-1252').decode('gb18030', 'replace')
猫垄鈥姑⑩dcx�盲赂沤忙艙鈥姑ヂ鈥姑ぢ宦р�得ヂ氓鈥⑩�
>>> print text.encode('sloppy-windows-1252').decode('utf8', 'ignore').encode('sloppy-windows-1252').decode('utf8', 'replace')
袋�dcx与朋�们���
The � U+FFFD REPLACEMENT CHARACTER shows the decoding wasn't entirely successful, but that could be due to the fact that your copied string here is missing anything not printable or using the 0x81 or 0x8D bytes.
You can try to fix your data this way; from the file data, try to decode to one of the GB* codecs after encoding to sloppy-windows-1252, or roundtrip from UTF-8 twice and see what fits best.
If that's not good enough (you cannot fix the data) you can use the ftfy.badness.sequence_weirdness() function to try and detect the issue:
>>> from ftfy.badness import sequence_weirdness
>>> sequence_weirdness(text)
9
>>> sequence_weirdness(u'元大寶來證券')
0
>>> sequence_weirdness(u'John Dove')
0
Mojibakes score high on the sequence weirdness scale. You'd could try and find an appropriate threshold for your data by which time you'd call the data most likely to be corrupted.
However, I think we can use a non-zero return value as a starting point for another test. English text should score 0 on that scale, and so should Chinese text. Chinese mixed with English can still score over 0, but you could not then encode that Chinese text to the CP-1252 codec while you can with the broken text:
from ftfy.badness import sequence_weirdness
def is_valid_unicode_str(text):
if not sequence_weirdness(text):
# nothing weird, should be okay
return True
try:
text.encode('sloppy-windows-1252')
except UnicodeEncodeError:
# Not CP-1252 encodable, probably fine
return True
else:
# Encodable as CP-1252, Mojibake alert level high
return False

Python gpgme non-ascii text handling

I am trying to encrypt-decrypt a text via GPG using pygpgme, while it works for western characters decryption fails on a Russian text. I use GPG suite on Mac to decrypt e-mail.
Here's the code I use to produce encrypted e-mail body, note that I tried to encode message in Unicode but it didn't make any difference. I use Python 2.7.
Please help, I must say I am new to Python.
ctx = gpgme.Context()
ctx.armor = True
key = ctx.get_key('0B26AE38098')
payload = 'Просто тест'
#plain = BytesIO(payload.encode('utf-8'))
plain = BytesIO(payload)
cipher = BytesIO()
ctx.encrypt([key], gpgme.ENCRYPT_ALWAYS_TRUST, plain, cipher)
There are multiple problems here. You really should read the Unicode HOWTO, but I'll try to explain.
payload = 'Просто тест'
Python 2.x source code is, by default, Latin-1. But your source clearly isn't Latin-1, because Latin-1 doesn't even have those characters. What happens if you write Просто тест in one program (like a text editor) as UTF-8, then read it in another program (like Python) as Latin-1? You get ÐÑоÑÑо ÑеÑÑ. So, what you're doing is creating a string full of nonsense. If you're using ISO-8859-5 rather than UTF-8, it'll be different nonsense, but still nonsense
So, first and foremost, you need to find out what encoding you did use in your text editor. It's probably UTF-8, if you're on a Mac, but don't just guess; find out.
Second, you have to tell Python what encoding you used. You do that by using an encoding declaration. For example, if your text editor uses UTF-8, add this line to the top of your code:
# coding=utf-8
One you fix that, payload will be a byte string, encoded in whatever encoding your text editor uses. But you can't encode already-encoded byte strings, only Unicode strings.
Python 2.x will let you call encode on them anyway, but it's not very useful—what it will do is first decode the string to Unicode using sys.getdefaultencoding, so it can then encode that. That's unlikely to be what you want.
The right way to fix this is to make payload a Unicode string in the first place, by using a Unicode literal. Like this:
payload = u'Просто тест'
Now, finally, you can actually encode the payload to UTF-8, which you did perfectly correctly in your first attempt:
plain = BytesIO(payload.encode('utf-8'))
Finally, you're encrypting UTF-8 plain text with GPG. When you decrypt it on the other side, make sure to decode it as UTF-8 there as well, or again you'll probably see nonsense.

Python string encoding issue

I am using the Amazon MWS API to get the sales report for my store and then save that report in a table in the database. Unfortunately I am getting an encoding error when I try to encode the information as Unicode. After looking through the report (exactly as amazon sent it) I saw this string which is the location of the buyer:
'S�o Paulo'
so I tried to encode it like so:
encodeme = 'S�o Paulo'
encodeme.encode('utf-8)
but got the following error
UnicodeDecodeError: 'ascii' codec can't decode byte 0xef in position 1: ordinal not in range(128)
The whole reason why I am trying to encode it is because as soon as Django sees the � character it throws a warning and cuts off the string, meaning that the location is saved as S instead of
São Paulo
Any help is appreciated.
It looks like you are having some kind of encoding problem.
First, you should be very certain what encoding Amazon is using in the report body they send you. Is it UTF-8? Is it ISO 8859-1? Something else?
Unfortunately the Amazon MWS Reports API documentation, especially their API Reference, is not very forthcoming about what encoding they use. They only encoding I see them mention is UTF-8, so that should be your first guess. The GetReport API documentation (p.36-37) describes the response element Report as being type xs:string, but I don't see where they define that data type. Maybe they mean XML Schema's string datatype.
So, I suggest you save the byte sequence you are receiving as your report body from Amazon in a file, with zero transformations. Be aware that your code which calls AWS might be modifying the report body string inadvertently. Examine the non-ASCII bytes in that file with a binary editor. Is the "São" of "São" stored as S\xC3\xA3o, indicating UTF-8 encoding? Or is it stored as S\xE3o, indicating ISO 8859-1 encoding?
I'm guessing that you receive your report as a flat file. The Amazon AWS documentation says that you can request reports be delivered to you as XML. This would have the advantage of giving you a reply with an explicit encoding declaration.
Once you know the encoding of the report body, you now need to handle it properly. You imply that you are using the Django framework and Python language code to receive the report from Amazon AWS.
One thing to get very clear (as Skirmantas also explains):
Unicode strings hold characters. Byte strings hold bytes (octets).
Encoding converts a Unicode string into a byte string.
Decoding converts a byte string into a Unicode string.
The string you get from Amazon AWS is a byte string. You need to decode it to get a Unicode string. But your code fragment, encodeme = 'São Paulo', gives you a byte string. encodeme.encode('utf-8) performs an encode() on the byte string, which isn't what you want. (The missing closing quote on 'utf-8 doesn't help.)
Try this example code:
>>> reportbody = 'S\xc3\xa3o Paulo' # UTF-8 encoded byte string
>>> reportbody.decode('utf-8') # returns a Unicode string, u'...'
u'S\xe3o Paulo'
You might find some background reading helpful. I agree with Hoxieboy that you should take the time to read Python's Unicode HOWTO. Also check out the top answers to What do I need to know about Unicode?.
I think you have to decode it using a correct encoding rather than encode it to utf-8. Try
s = s.decode('utf-8')
However you need to know which encoding to use. Input can come in other encodings that utf-8.
The error which you received UnicodeDecodeError means that your object is not unicode, it is a bytestring. When you do bytestring.encode, the string firstly is decoded into unicode object with default encoding (ascii) and only then it is encoded with utf-8.
I'll try to explain the difference of unicode string and utf-8 bytestring in python.
unicode is a python's datatype which represents a unicode string. You use unicode for most of string operations in your program. Python probably uses utf-8 in its internals though it could also be utf-16 and this doesn't matter for you.
bytestring is a binary safe string. It can be of any encoding. When you receive data, for example you open a file, you get a bytestring and in most cases you will want to decode it to unicode. When you write to file you have to encode unicode objects into bytestrings. Sometimes decoding/encoding is done for you by a framework or library. Not always however framework can do this because not always framework can known which encoding to use.
utf-8 is an encoding which can correctly represent any unicode string as a bytestring. However you can't decode any kind of bytestring with utf-8 into unicode. You need to know what encoding is used in the bytestring to decode it.
Official Python unicode documentation
You might try that webpage if you haven't already and see if you can get the answer you're looking for ;)

Categories