Unicode and locale issues

Unicode and locale issues - python

I am struggling to write a Python (version 2.7) script which makes use of some unicode properties. The problem arises when I attempt to use embedded locale package. Here is the code snippet that I am having issues with:
# -*- coding: utf-8 -*-
import datetime
import os
import locale
locale.setlocale(locale.LC_ALL, 'greek')
day = datetime.date.today()
dayFull = day.strftime('%A')
myString = u"ΚΑΛΗΜΕΡΑ"
print myString
print dayFull
While dayFull prints the current day name just fine (in greek letters), myString comes out in console as question mark characters. How can I fix it, can someone please point out my mistake here?
P.S. My system is a Windows 7 machine.

Use the correct Greek code page in the console, as well as a font that supports Greek characters, such as Consolas. This worked for me in Windows 7 and Python 2.7.3:
C:\>chcp 1253
Active code page: 1253
C:\>python temp.py
ΚΑΛΗΜΕΡΑ
Σάββατο
FYI, Python 3.3 works correctly with the (also Greek) 737 code page, but Python 2.7 prints:
C:\>temp.py
????????
Σάββατο

Related

VScode's Python Debug Console doesn't print Unicode Chinese Charactor Properly

Image to show the problemHere is the code to illustrate the problem:
# -*- coding:utf-8 -*-
text = u"严"
print text
If I run the code above in VSCode debug, it will prints "涓" instead of "严", which is the result of the first 2 byte (\xe4\xb8) of u"严" in UTF-8 (\xe4\xb8\xa5), decoded in gbk codec. \xe4\xb8 in gbk is "涓".
However if I run the same code in pycharm it prints "严" exactly as I expected. And it is the same If I run the code in powershell.
Wired the VSCode python debugger behaves different with python interpreter. How can I get the print result correct, I do not think add a decode("gbk") in the end of every text would be a good idea.
My Environment data
VS Code version: 1.21
VSCode Python Extension version : 2018.2.1
OS and version: Windows 10
Python version : 2.7.14
Type of virtual environment used : No

For Windows users, in your System Variables, add PYTHONIOENCODING Variables,change its value to UTF-8, then restart vscode, this worked on my pc.
Modify task.json file in vscode, I am not sure if it will still work on version 2.0.
You can find it here:Changing the encoding for a task output
or here in github:
Tasks should support specifying the output encoding
add this before you start a py script:
import io
import sys
sys.stdout = io.TextIOWrapper(sys.stdout.buffer,encoding='utf8')

If you open your python file in VS2017 you can do the following:
Go to:
File->
Save selected item as ->
click on the down-arrow next to "Save button"
clicking "Save With Encoding...
select the type of coding you need...
if .py already saved then overwrite file > select "yes"
select for example : "Chinese Simplified (GB18030) - Codepage 54936"
Also, add the following on line 2 of your .py file:
# -*- coding: gb18030 -*- or # -*- coding: gb2312 -*-
Those encodings accept your 严 character.
Nice link to endocoder/decoder tester here.

What's the deal with Python 3.4, Unicode, different languages and Windows?

Happy examples:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
czech = u'Leoš Janáček'.encode("utf-8")
print(czech)
pl = u'Zdzisław Beksiński'.encode("utf-8")
print(pl)
jp = u'リング 山村 貞子'.encode("utf-8")
print(jp)
chinese = u'五行'.encode("utf-8")
print(chinese)
MIR = u'Машина для Инженерных Расчётов'.encode("utf-8")
print(MIR)
pt = u'Minha Língua Portuguesa: çáà'.encode("utf-8")
print(pt)
Unhappy output:
b'Leo\xc5\xa1 Jan\xc3\xa1\xc4\x8dek'
b'Zdzis\xc5\x82aw Beksi\xc5\x84ski'
b'\xe3\x83\xaa\xe3\x83\xb3\xe3\x82\xb0 \xe5\xb1\xb1\xe6\x9d\x91 \xe8\xb2\x9e\xe5\xad\x90'
b'\xe4\xba\x94\xe8\xa1\x8c'
b'\xd0\x9c\xd0\xb0\xd1\x88\xd0\xb8\xd0\xbd\xd0\xb0 \xd0\xb4\xd0\xbb\xd1\x8f \xd0\x98\xd0\xbd\xd0\xb6\xd0\xb5\xd0\xbd\xd0\xb5\xd1\x80\xd0\xbd\xd1\x8b\xd1\x85 \xd0\xa0\xd0\xb0\xd1\x81\xd1\x87\xd1\x91\xd1\x82\xd0\xbe\xd0\xb2'
b'Minha L\xc3\xadngua Portuguesa: \xc3\xa7\xc3\xa1\xc3\xa0'
And if I print them like this:
jp = u'リング 山村 貞子'
print(jp)
I get:
Traceback (most recent call last):
File "x.py", line 5, in <module>
print(jp)
File "C:\Python34\lib\encodings\cp850.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_map)[0]
UnicodeEncodeError: 'charmap' codec can't encode characters in position
0-2: character maps to <undefined>
I've also tried the following from this question (And other alternatives that involve sys.stdout.encoding):
#!/usr/bin/env python
# -*- coding: utf-8 -*-
from __future__ import print_function
import sys
def safeprint(s):
try:
print(s)
except UnicodeEncodeError:
if sys.version_info >= (3,):
print(s.encode('utf8').decode(sys.stdout.encoding))
else:
print(s.encode('utf8'))
jp = u'リング 山村 貞子'
safeprint(jp)
And things get even more cryptic:
πâ¬πâ│πé░ σ▒▒µ¥æ Φ▓₧σ¡É
And the docs were not very helpful.
So, what's the deal with Python 3.4, Unicode, different languages and Windows? Almost all possible examples I could find, deal with Python 2.x.
Is there a general and cross-platform way of printing ANY Unicode character from any language in a decent and non-nasty way in Python 3.4?
EDIT:
I've tried typing at the terminal:
chcp 65001
To change the code page, as proposed here and in the comments, and it did not work (Including the attempt with sys.stdout.encoding)

Update: Since Python 3.6, the code example that prints Unicode strings directly should just work now (even without py -mrun).
Python can print text in multiple languages in Windows console whatever chcp says:
T:\> py -mpip install win-unicode-console
T:\> py -mrun your_script.py
where your_script.py prints Unicode directly e.g.:
#!/usr/bin/env python3
print('š áč') # cz
print('ł ń') # pl
print('リング') # jp
print('五行') # cn
print('ш я жх ё') # ru
print('í çáà') # pt
All you need is to configure the font in your Windows console that can display the desired characters.
You could also run your Python script via IDLE without installing non-stdlib modules:
T:\> py -midlelib -r your_script.py
To write to a file/pipe, use PYTHONIOENCODING=utf-8 as #Mark Tolonen suggested:
T:\> set PYTHONIOENCODING=utf-8
T:\> py your_script.py >output-utf8.txt
Only the last solution supports non-BMP characters such as 😒 (U+1F612 UNAMUSED FACE) -- py -mrun can write them but Windows console displays them as boxes even if the font supports corresponding Unicode characters (though you can copy-paste the boxes into another program, to get the characters).

The problem iswas (see Python 3.6 update below) with the Windows console, which supports an ANSI character set appropriate for the region targeted by your version of Windows. Python throws an exception by default when unsupported characters are output.
Python can read an environment variable to output in other encodings, or to change the error handling default. Below, I've read the console default and change the default error handling to print a ? instead of throwing an error for characters that are unsupported in the console's current code page.
C:\>chcp
Active code page: 437 # Note, US Windows OEM code page.
C:\>set PYTHONIOENCODING=437:replace
C:\>example.py
Leo? Janá?ek
Zdzis?aw Beksi?ski
??? ?? ??
??
?????? ??? ?????????? ????????
Minha Língua Portuguesa: çáà
Note the US OEM code page is limited to ASCII and some Western European characters.
Below I've instructed Python to use UTF8, but since the Windows console doesn't support it, I redirect the output to a file and display it in Notepad:
C:\>set PYTHONIOENCODING=utf8
C:\>example >out.txt
C:\>notepad out.txt
On Windows, its best to use a Python IDE that supports UTF-8 instead of the console when working with multiple languages. If only using one language, select it as the system locale in the Region and Language control panel and the console will support the characters of that language.
Update for Python 3.6
Python 3.6 now uses Windows Unicode APIs to write directly to the console, so the only limit is the console font's support of the characters. The following code works in a US Windows console. I have a Chinese language pack installed, it even displays the Chinese and Japanese if the console font is changed. Even without the correct font, replacement characters are shown in the console. Cut-n-paste to an environment such as this web page will display the characters correctly.
#!python3.6
#coding: utf8
czech = 'Leoš Janáček'
print(czech)
pl = 'Zdzisław Beksiński'
print(pl)
jp = 'リング 山村 貞子'
print(jp)
chinese = '五行'
print(chinese)
MIR = 'Машина для Инженерных Расчётов'
print(MIR)
pt = 'Minha Língua Portuguesa: çáà'
print(pt)
Output:
Leoš Janáček
Zdzisław Beksiński
リング 山村 貞子
五行
Машина для Инженерных Расчётов
Minha Língua Portuguesa: çáà

How to write utf8 to standard output in a way that works with python2 and python3

I want to write a non-ascii character, lets say → to standard output. The tricky part seems to be that some of the data that I want to concatenate to that string is read from json. Consider the follwing simple json document:
{"foo":"bar"}
I include this because if I just want to print → then it seems enough to simply write:
print("→")
and it will do the right thing in python2 and python3.
So I want to print the value of foo together with my non-ascii character →. The only way I found to do this such that it works in both, python2 and python3 is:
getattr(sys.stdout, 'buffer', sys.stdout).write(data["foo"].encode("utf8")+u"→".encode("utf8"))
or
getattr(sys.stdout, 'buffer', sys.stdout).write((data["foo"]+u"→").encode("utf8"))
It is important to not miss the u in front of → because otherwise a UnicodeDecodeError will be thrown by python2.
Using the print function like this:
print((data["foo"]+u"→").encode("utf8"), file=(getattr(sys.stdout, 'buffer', sys.stdout)))
doesnt seem to work because python3 will complain TypeError: 'str' does not support the buffer interface.
Did I find the best way or is there a better option? Can I make the print function work?

The most concise I could come up with is the following, which you may be able to make more concise with a few convenience functions (or even replacing/overriding the print function):
# -*- coding=utf-8 -*-
import codecs
import os
import sys
# if you include the -*- coding line, you can use this
output = 'bar' + u'→'
# otherwise, use this
output = 'bar' + b'\xe2\x86\x92'.decode('utf-8')
if sys.stdout.encoding == 'UTF-8':
print(output)
else:
output += os.linesep
if sys.version_info[0] >= 3:
sys.stdout.buffer.write(bytes(output.encode('utf-8')))
else:
codecs.getwriter('utf-8')(sys.stdout).write(output)
The best option is using the -*- encoding line, which allows you to use the actual character in the file. But if for some reason, you can't use the encoding line, it's still possible to accomplish without it.
This (both with and without the encoding line) works on Linux (Arch) with python 2.7.7 and 3.4.1.
It also works if the terminal's encoding is not UTF-8. (On Arch Linux, I just change the encoding by using a different LANG environment variable.)
LANG=zh_CN python test.py
It also sort of works on Windows, which I tried with 2.6, 2.7, 3.3, and 3.4. By sort of, I mean I could get the '→' character to display only on a mintty terminal. On a cmd terminal, that character would display as 'ΓåÆ'. (There may be something simple I'm missing there.)

If you don't need to print to sys.stdout.buffer, then the following should print fine to sys.stdout. I tried it in both Python 2.7 and 3.4, and it seemed to work fine:
# -*- coding=utf-8 -*-
print("bar" + u"→")

Problems with Encoding in Eclipse Console and Python

I guess I need some help regarding encodings in Python (2.6) and Eclipse. I used Google and the so-search and tried a lot of things but as a matter of fact I don't get it.
So, how do I achieve, that the output in the Eclipse console is able to show äöü etc.?
I tried:
Declaring the document encoding in the first line with
# -*- coding: utf-8 -*-
I changed the encoding settings in Window/Preferences/General/Workspace and Project/Properties to UTF-8
As nothing changed I tried the following things alone and in combination but nothing seemed to work out:
Changing the stdout as mentioned in the Python Cookbook:
sys.stdout = codecs.lookup("utf-8")-1
Adding an unicode u:
print u"äöü".encode('UTF8')
reloading sys (I don't know what for but it doesn't work either ;-))
I am trying to do this in order to debug the encoding-problems I have in my programs... (argh)
Any ideas? Thanks in advance!
EDIT:
I work on Windows 7 and it is EasyEclipse

Got it! If you have the same problem go to
Run/Run Configurations/Common and select the UTF-8 (e.g.) as console encoding.
So, finally, print "ö" results in "ö"

Even this is a bit old question, I'm new in StackOverflow and I'd like to contribute a bit. You can change the default encoding in Eclipse (currently Neon) for the all text editors from the menu Window -> Preferences -> General -> Workspace : Text file encoding
Item Path

How to print japanese utf-8 on console in windows?

#coding=<utf8>
import os
os.popen('chcp 65001')
a = 'こんにちは世界'
print a.decode('utf8')
x = raw_input()
PYTHON 2.6 on Windows 7
It will run in IDLE with no errors.
However when run from the console, it errors and flashes very quickly and I can't read the error message.
How can it be done in windows console?
By the way, doing this with other languages like spanish or portuguese will work fine. It's languages like japanese, russian, greek, hebrew that have this error behavior in the windows console.
*EDIT
as requested I changed to this code:
#coding=<utf8>
import os, sys
os.popen('chcp 65001')
print(sys.stdout.encoding)
x = raw_input('press enter to continue')
a = 'こんにちは世界'
print a.decode('utf8')
x = raw_input()
It will print:
cp437
and then of course, continue on to flash and fail on the decoding bit...
It looks like the popen('chcp 65001') doesn't work in changing the codepage.
I still don't think this is the root of the problem, however it would be helpful to know an efficient way of changing this codepage.

Update
Never mind. The OP is using Windows.
Interestingly changing the encoding declaration to #encoding=<utf8> did not work in Ubuntu.
Original Answer
This worked for me (Ubuntu Jaunty, Python 2.6.2). The only change I made was to the first line declaring the encoding.
# encoding: utf-8
import os
os.popen('chcp 65001')
a = 'こんにちは世界'
print a.decode('utf8')
x = raw_input()

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.