Why is Python3 decode/encode duplicating characters?

Why is Python3 decode/encode duplicating characters? - python

I am trying to check the subject of emails in foreign languages for automatic tests. I was having issues with encoding so I decided to try and write something that handles the encoding of the subject. In this case it's given to me in base64. Converting this to utf-8 and then decoding it produces this strange double character problem. Here is some code I used to test this:
import base64
ja_str = "こんにちは"
encoded_js = base64.b64encode(ja_str.encode())
print (encoded_js)
print(base64.b64decode(encoded_js).decode())
The result of the above is:
b'44GT44KT44Gr44Gh44Gv'
ここんんににちちはは

Related

Is there a way to get around unicode issues when using win32api/com modules in python 3?

I've looked around and haven't found anything just yet. I'm going through emails in an inbox and checking for a specific word set. It works on most emails but some of them don't parse. I checked the broken emails using.
print (msg.Body.encode('utf8'))
and my problem messages all start with b'.
like this
b'\xe6\xa0\xbc\xe6\xb5\xb4\xe3\xb9\xac\xe6\xa0\xbc\xe6\x85\xa5\xe3\xb9\xa4\xe0\xa8\x8d\xe6\xb4\xbc\xe7\x91\xa5\xe2\x81\xa1\xe7\x91\x
I think this is forcing python to read the body as bytes but I'm not sure. Either way after the b, no matter what encoding I try I don't get anything but garbage text.
I've tried other encoding methods as well decoding before but I'm just getting a ton of attribute errrors.
import win32api
import win32com.client
import datetime
import os
import time
outlook = win32com.client.Dispatch("Outlook.Application").GetNamespace("MAPI")
dater = datetime.date.today() - datetime.timedelta(days = 1)
dater = str(dater.strftime("%m-%d-%Y"))
print (dater)
#for folders in outlook.folders:
# print(folders)
Receipt = outlook.folders[8]
print(Receipt)
Ritems = Receipt.folders["Inbox"]
Rmessage = Ritems.items
for msg in Rmessage:
if (msg.Class == 46 and msg.CreationTime.strftime("%m-%d-%Y") == dater):
print (msg.CreationTime)
print (msg.Subject)
print (msg.Body.encode('utf8'))
print ('..............................')
End result is to have the message printed out in the console, or at least give Python a way to read it so I can find the text I'm looking for in the body.

The byte literal posted in the question is valid UTF-8. First two characters are U+683C and U+6D74 from the CJK Unified Ideographs block, U+4E00 - U+9FFF.
Since you don't know the source encoding there is no way to be completely sure about it, but chances are that email body is just Han characters encoded in UTF-8 (Determine the encoding of text in Python). If you are not being able to see the UTF-8 characters correctly you should check your terminal or display character set.
That said, you should to get the fundamentals of character representation right. Randomly encoding or decoding is hardly going to solve anything. I would suggest you begin by reading Spolsky's introduction to Unicode and then move to Batchelder on Unicode in Python.

As martineau said the proper encoding I was searching for was utf16. The other messages were encoded using utf8. So a simple mail scrape turned out to be an excellent lesson in encoding as well message Classes (off topic). Thanks for the help.

Email quoted printable encoding confusion

I'm constructing MIME encoded emails with Python and I'm getting a difference with the same email that is MIME encoded by Amazon's SES.
I'm encoding using utf-8 and quoted-printable.
For the character "å" (that's the letter "a" with a little circle on top), my encoding produces
=E5
and the other encoding produces
=C3=A5
They both look ok in my gmail, but I find it weird that the encoding is different. Is one of these right and the other wrong in any way?
Below is my Python code in case that helps.
====
cs = charset.Charset('utf-8')
cs.header_encoding = charset.QP
cs.body_encoding = charset.QP
# See https://stackoverflow.com/a/16792713/136598
mt = mime.text.MIMEText(None, subtype)
mt.set_charset(cs)
mt.replace_header("content-transfer-encoding", "quoted-printable")
mt.set_payload(mt._charset.body_encode(payload))

Ok, I was able to figure this out, thanks to Artur's comment.
The utf-8 encoding of the character is two bytes and not one so you should expect to see two quoted printable encodings and not one so the AWS SES encoding is correct (not surprisingly).
I was sending unicode text and not utf-8 which causes only one quoted printable character. It turns out that it worked because gmail supports unicode.
For the Python code in my question, I need to manually encode the text as utf-8. I was thinking that MIMEText would do that for me but it does not.

Python gpgme non-ascii text handling

I am trying to encrypt-decrypt a text via GPG using pygpgme, while it works for western characters decryption fails on a Russian text. I use GPG suite on Mac to decrypt e-mail.
Here's the code I use to produce encrypted e-mail body, note that I tried to encode message in Unicode but it didn't make any difference. I use Python 2.7.
Please help, I must say I am new to Python.
ctx = gpgme.Context()
ctx.armor = True
key = ctx.get_key('0B26AE38098')
payload = 'Просто тест'
#plain = BytesIO(payload.encode('utf-8'))
plain = BytesIO(payload)
cipher = BytesIO()
ctx.encrypt([key], gpgme.ENCRYPT_ALWAYS_TRUST, plain, cipher)

There are multiple problems here. You really should read the Unicode HOWTO, but I'll try to explain.
payload = 'Просто тест'
Python 2.x source code is, by default, Latin-1. But your source clearly isn't Latin-1, because Latin-1 doesn't even have those characters. What happens if you write Просто тест in one program (like a text editor) as UTF-8, then read it in another program (like Python) as Latin-1? You get ÐÑÐ¾ÑÑÐ¾ ÑÐµÑÑ. So, what you're doing is creating a string full of nonsense. If you're using ISO-8859-5 rather than UTF-8, it'll be different nonsense, but still nonsense
So, first and foremost, you need to find out what encoding you did use in your text editor. It's probably UTF-8, if you're on a Mac, but don't just guess; find out.
Second, you have to tell Python what encoding you used. You do that by using an encoding declaration. For example, if your text editor uses UTF-8, add this line to the top of your code:
# coding=utf-8
One you fix that, payload will be a byte string, encoded in whatever encoding your text editor uses. But you can't encode already-encoded byte strings, only Unicode strings.
Python 2.x will let you call encode on them anyway, but it's not very useful—what it will do is first decode the string to Unicode using sys.getdefaultencoding, so it can then encode that. That's unlikely to be what you want.
The right way to fix this is to make payload a Unicode string in the first place, by using a Unicode literal. Like this:
payload = u'Просто тест'
Now, finally, you can actually encode the payload to UTF-8, which you did perfectly correctly in your first attempt:
plain = BytesIO(payload.encode('utf-8'))
Finally, you're encrypting UTF-8 plain text with GPG. When you decrypt it on the other side, make sure to decode it as UTF-8 there as well, or again you'll probably see nonsense.

encoding problems writing UTF-8 SQL statements to a local file

I'm writing SQL to a file on a server this way:
import codecs
f = codecs.open('translate.sql',mode='a',encoding='utf8',errors='strict')
and then writing SQL statements like this:
query = (u"""INSERT INTO story_question_response
(group_id, story_id, question_id, answer )
VALUES
(%s,#last_story_id,%s,'%s');
""" % (kw.get('to'), lookup.get(q), kw.get(q)))
f.write(query)
I have confirmed that the text was okay when I pulled it. Here is the data from the dictionary (kw) passed out to a webpage:
46:埼玉県
47:熊谷市
42:お散歩デモ
It appears correct (I want it to be utf8 escaped).
But the file.write output is garbage (encoding problems):
INSERT INTO story_question_response
(group_id, story_id, question_id, answer )
VALUES
(279,#last_story_id,62,'ãã©ã³ãã£ã¢ããã'); )
/* updating the story text on old story_id */
UPDATE story_question_response
SET answer = 'å¤§å¦ã®ããã·ã§ã¯ãã¦å¦çãæ¬å¤§éç½ã®è¢«ç½å°(å²©æçã®å¤§è¹æ¸¡å¸)ã«æ´¾é£ãããããã¦ã¯ç¾å°ã®å¤ç¥ãã®ãæ$
WHERE story_id = 65591
AND question_id = 41
AND group_id = 276;
using an explicit decode gives an error:
f.write(query.decode('utf8'))
I don't know what else to try.
Question: What am I doing wrong, in writing a utf8 file?

We don't have enough information to be sure, but I'd give decent odds that your file is actually perfectly valid UTF-8, and you're just viewing it as if it were something else.
For example, on Windows, if you open a file in Notepad, by default, it will only treat it as UTF-8 if it starts with a UTF-8 BOM (which no valid file ever should, but Microsoft likes them anyway); otherwise, it will treat it as whatever your default code page is. Which is probably some Latin-1 derivative like CP1252.
So, your string of kana and kanji ends up encoded as a bunch of three-byte UTF-8 sequences like '\xe6\xad\xa9'. Then, that gets displayed in Notepad as whatever each of those bytes happen to mean in CP1252, like æ© (note that there's an invisible character between the two visible ones).
As a general rule, whenever you see weirdly-accented versions of lowercase A and E every 2 or 3 characters, that almost always means you've interpreted some CJK UTF-8 as some Latin-1-derived character set, because UTF-8 uses \xE3 through \xED as the prefix bytes for most CJK characters, and Latin-1 has accented lowercase A and E characters in that range. (Similarly, weirdly-accented capital A versions usually mean European or symbolic UTF-8 interpreted as Latin-1, especially when you've got stray Âs inserted into what looks like otherwise valid or almost-valid European text. If you look at the charts, you should be able to tell why.)

Assuming your input is utf8, you should probably use the following code to generate the query:
query = (u"""INSERT INTO story_question_response
(group_id, story_id, question_id, answer )
VALUES
(%s,#last_story_id,%s,'%s');
""" % (kw.get('to').decode('utf8'), lookup.get(q).decode('utf8'), kw.get(q).decode('utf8')))
I would also suggest trying to output the contents of kw and lookup to some log file to debug this issue.
You should use encode on objects of class unicode, and decode on objects of class str in python.
You should escape any string you insert into SQL statement to prevent nasty SQL injections.
The code above doesn't include such escaping, so be careful.

how to url-safe encode a string with python? and urllib.quote is wrong

Hello i was wondering if you know any other way to encode a string to a url-safe, because urllib.quote is doing it wrong, the output is different than expected:
If i try
urllib.quote('á')
i get
'%C3%A1'
But thats not the correct output, it should be
%E1
As demostrated by the tool provided here this site
And this is not me being difficult, the incorrect output of quote is preventing the browser to found resources, if i try
urllib.quote('\images\á\some file.jpg')
And then i try with the javascript tool i mentioned i get this strings respectively
%5Cimages%5C%C3%A1%5Csome%20file.jpg
%5Cimages%5C%E1%5Csome%20file.jpg
Note how is almost the same but the url provided by quote doesn't work and the other one it does.
I tried messing with encode('utf-8) on the string provided to quote but it does not make a difference.
I tried with other spanish words with accents and the ñ they all are differently represented.
Is this a python bug?
Do you know some module that get this right?

According to RFC 3986, %C3%A1 is correct. Characters are supposed to be converted to an octet stream using UTF-8 before the octet stream is percent-encoded. The site you link is out of date.
See Why does the encoding's of a URL and the query string part differ? for more detail on the history of handling non-ASCII characters in URLs.

Ok, got it, i have to encode to iso-8859-1 like this
word = u'á'
word = word.encode('iso-8859-1')
print word

Python is interpreted in ASCII by default, so even though your file may be encoded differently, your UTF-8 char is interpereted as two ASCII chars.
Try putting a comment as the first of second line of your code like this to match the file encoding, and you might need to use u'á' also.
# coding: utf-8

What about using unicode strings and the numeric representation (ord) of the char?
>>> print '%{0:X}'.format(ord(u'á'))
%E1

In this question it seems some guy wrote a pretty large function to convert to ascii urls, thats what i need. But i was hoping there was some encoding tool in the std lib for the job.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.