python utf-8 japanese

python utf-8 japanese - python

I have some Japanese words I wish to convert to utf-8, as shown below:
jap_word1 = u'中山'
jap_word2 = u'小倉'
print jap_word1.encode('utf-8') # Doesn't work
print jap_word2.encode('utf-8') # Prints properly
Why is it that one word can be converted properly to utf-8 and printed to show the same characters but not the other?
(I am using python 2.6 on Windows 7 Ultimate)

Lots of things must align to print characters properly:
What encoding is the script saved in?
Do you have a # coding: xxxx statement in your script, where xxxx matches the encoding the file is saved in?
Does your output terminal support your encoding? import sys; print sys.stdout.encoding
a. If not, can you change the console encoding? (chcp command on Windows)
Does the font you are using support the characters?
Saving the script in UTF-8, this works in both PythonWin and IDLE.
# coding: utf-8
jap_word1 = u'中山'
jap_word2 = u'小倉'
print jap_word1
print jap_word2
Interestingly, I got your results with the .encode('utf-8') added to both prints in IDLE, but it worked correctly in Pythonwin, whose default output window supports UTF-8.
Idle is a strange beast. sys.stdout.encoding on my system produces 'cp1252', which doesn't support Asian characters, but it prints the first word wrong and the second one right when printing in UTF-8.

Because your console is not in UTF-8. Run chcp 65001 before running.

Related

Cyrillic chars in Python 2.7

In my script I pointed 1251 codepage. But Python 2.7.13 output incorrectly shows some cyrillic strings:
Программа 'Game Over' 2.0
('\xd2\xee \xe6\xe5', '\xf1\xe0\xec\xee\xe5', '\xf1\xee\xee\xe1\xf9\xe5\xed\xe8\xe5')
('\xd2\xee\xeb\xfc\xea\xee', '\xf7\xf3\xf2\xfc-\xf7\xf3\xf2\xfc', '\xef\xee\xe1\xee\xeb\xfc\xf8\xe5')
оно...
GAME OVER
Нажмите Enter для выхода...
I read this and this topics before but it didn't help me. I tried such variants:
# -*- coding: utf-8 -*-
# -*- coding: cp1251 -*-
Why does it happen and how can I fix it?
At the same time Python 3.6.0 output writes all cyrillic chars correctly even without the codepage pointing:
Программа 'Game Over' 2.0
То же самое сообщение
Только чуть-чуть побольше
оно...
GAME OVER
Нажмите Enter для выхода...
My code:
# coding: cp1251
# game_over.py
# © Andrey Bushman, 2017
print("Программа 'Game Over' " + "2.0")
print("То же", "самое", "сообщение")
print("Только", "чуть-чуть", "побольше")
#print("Вот", end=" ")
print("оно...")
print("""
GAME OVER
""")
print("\a")
input("\n\nНажмите Enter для выхода...")

For 2.7, you should make the strings unicode strings by using the u prefix. The following works both in IDLE and the console (when the console codepage is set to 1251 with chcp 1251).
# coding: utf_8
# game_over.py
# Andrey Bushman, 2017
from __future__ import print_function
print(u"Программа 'Game Over' 2.0"
)
print (u"То же самое сообщение")
print(u"Только чуть-чуть побольше")
print(u"оно...")
print("""
GAME OVER
""")
print(u"\n\nНажмите Enter для выхода...", end='')
a = raw_input()
I separated the prompt and input because input(u'xxxx') was not working. raw_input is needed in 2.x to avoid evaluating the input.

print("То же", "самое", "сообщение")
Nothing to do with Cyrillic -- Python 2's print statement doesn't have parentheses.
So here you're printing the tuple ("То же", "самое", "сообщение"), not a string. This does the same thing:
tmp = ("То же", "самое", "сообщение")
print tmp
Either remove the parentheses, or add from __future__ import print_function at the top of your module.

I spend quite some time figuring out how to use Python 2.7 properly with non-latin1 code pages. The easiest solution I found, by far, is to switch to Python 3. Nothing else comes remotely close to it.

The print statement in python2 evaluates each comma-separated expression within the brackets and converts them to a string before it's printed. That's why each cyrillic character is converted to ASCII when you separate the values with commas.
What you can do is the following:
import codecs
text = ("То же", "самое", "сообщение")
for i in text:
(codecs.decode(i, 'utf-8'))
Or:
text = ("То же", "самое", "сообщение")
print(' '.join(text))
Make sure you have the following line at the top of your python script if you're using python2.
# -*- coding: utf-8 -*-

Short answer: If you want to print chars other than ascii or those in your default codepage on Windows, use 3.6+. Explanation below.
To properly read a file, the encoding declaration must match the actual encoding of the bytes in the file. If you use a limited (non-utf) encoding and want to print strings to Command Prompt, the limited encoding and the console encoding should also match. Or rather, the subset of unicode that you try to print must be included the the subset that the console will accept.
In this case, if you declare the encoding as cp1251 and save it with IDLE, then IDLE appears to save it with that encoding. By definition, the only chars in the file must be in the cp1251 subset. When you print those characters, the console must accept at least the same subset. You can make Command Prompt accept Russian by running chcp 1251 as a command. (chcp == CHange CodePage.) Warning: this command only affects the current Command Prompt window. Anyway, by matching the encoding declaration and the console codepage, I got your code to run on 2.7, 3.5, and 3.6 in the console (but not in IDLE 2.7). But of course, non-ascii, non-cyrillic chars generated by your code will not print.
In 3.x, Python expects code to be utf_8 by default. For 3.6, Python's interface to Windows' consoles was re-written to put the console in utf_8 mode. So write code in an editor that saves it as utf_8 and, as you noticed, printing to the console in Windows works in 3.6. (In 3.x, printing to IDLE's shell has always worked for all the Basic Multilingual Plane (BMP subset) of unicode. Not working for higher codepoints is a current limitation of tk and hence tkinter, which IDLE uses.)

Unicode with cp1251 and utf-8 on windows

I am playing around with unicode in python.
So there is a simple script:
# -*- coding: cp1251 -*-
print 'юникод'.decode('cp1251')
print unicode('юникод', 'cp1251')
print unicode('юникод', 'utf-8')
In cmd I've switched encoding to Active code page: 1251.
And there is the output:
СЋРЅРёРєРѕРґ
СЋРЅРёРєРѕРґ
юникод
I am a little bit confused.
Since I've specified encoding to cp1251 I expect that it would be decoded correctly.
But as result there is some trash code points were interpreted.
I am understand that 'юникод' is just a bytes like:
'\xd1\x8e\xd0\xbd\xd0\xb8\xd0\xba\xd0\xbe\xd0\xb4'.
But there is a way to get correct output in terminal with cp1251?
Should I build byte string manually?
Seems like I misunderstood something.

I think I can understand what happened to you. The last line gave me the hint, that your trash codepoints confirmed. You try to display cp1251 characters but your editor is configured to use utf8.
The # -*- coding: cp1251 -*- is only used by the Python interpretor to convert characters from source python files that are outside of the ASCII range. And anyway it it is only used for unicode litterals because bytes from original source give er... exactly same bytes in byte strings. Some text editors are kind enough to automagically use this line (IDLE editor is), but I'm little confident in that and allways switch manually to the proper encoding when I use gvim for example. Short story: # -*- coding: cp1251 -*- in unused in your code and can only mislead a reader since it it not the actual encoding.
If you want to be sure of what lies in your source, you'd better use explicit escapes. In code page 1251, this word юникод is composed by those characters: '\xfe\xed\xe8\xea\xee\xe4'
If you write this source:
txt = '\xfe\xed\xe8\xea\xee\xe4'
print txt
print txt.decode('cp1251')
print unicode(txt, 'cp1251')
print unicode(txt, 'utf-8')
and execute it in a console configured to use CP1251 charset, the first three lines will output юникод, and the last one will throw a UnicodeDecodeError exception because the input is no longer valid 'utf8'.
Alternatively, if you find comfortable with you current editor, you could write:
# -*- coding: utf8 -*-
txt = 'юникод'.decode('utf8').encode('cp1251') # or simply txt = u'юникод'.encode('cp1251')
print txt
print txt.decode('cp1251')
print unicode(txt, 'cp1251')
print unicode(txt, 'utf-8')
which should give same results - but now the declared source encoding should be the actual encoding of your python source.
BTW, a Python 3.5 IDLE that natively uses unicode confirmed that:
>>> 'СЋРЅРёРєРѕРґ'.encode('cp1251').decode('utf8')
'юникод'

Your issue is that the encoding declaration is wrong: your editor uses utf-8 character encoding to save the source code. Use # -*- coding: utf-8 -*- to fix it.
>>> u'юникод'
u'\u044e\u043d\u0438\u043a\u043e\u0434'
>>> u'юникод'.encode('utf-8')
'\xd1\x8e\xd0\xbd\xd0\xb8\xd0\xba\xd0\xbe\xd0\xb4'
>>> print _.decode('cp1251') # mojibake due to the wrong encoding
СЋРЅРёРєРѕРґ
>>> print u'юникод'
юникод
Do not use bytestrings ('' literals create bytes object on Python 2) to represent text; use Unicode strings (u'' literals -- unicode type) instead.
If your code uses Unicode strings then a code page that your Windows console uses doesn't matter as long as the chosen font can display the corresponding (non-BMP) characters. See Python, Unicode, and the Windows console
Here's complete code, for reference:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
print(u'юникод')
Note: no .decode(), unicode(). If you are using a literal to create a string; you should use Unicode literals if the string contains text. It is the only option on Python 3 where you can't put non-ascii characters inside a bytes literal and it is a good practice (to use Unicode for text instead of bytestrings) on Python 2 too.
If you are given a bytestring as an input (not literal) by some API then its encoding has nothing to do with the encoding declaration. What specific encoding to use depends on the source of the data.

Just use the following, but ensure you save the source code in the declared encoding. It can be any encoding that supports the characters you want to print. The terminal can be in a different encoding, as long as it also supports the characters you want to print:
#coding:utf8
print u'юникод'
The advantage is that you don't need to know the terminal's encoding. Python will normally1 detect the terminal encoding and encode the print output correctly.
1Unless your terminal is misconfigured.

Why I cannot save file with Chinese characters when using Python 2.7.11 IDLE?

I just downloaded the latest Python 2.7.11 64bit from its official website and installed it to my Windows 10. And I found that if the new IDLE file contains Chinese character, like 你好, then I cannot save the file. If I tried to save it for several times, then the new file crashed and disappeared.
I also installed the latest python-3.5.1-amd64.exe, and it does not have this issue.
How to solve it?
More：
A example code from wiki page, https://zh.wikipedia.org/wiki/%E9%B8%AD%E5%AD%90%E7%B1%BB%E5%9E%8B
If I past the code here, StackOverflow alays warn me: Body cannot contain "I just dow". Why?
Thanks!
More:
I find this config option, but it does not help at all.
IDLE -> Options -> Configure IDLE -> General -> Default Source Encoding: UTF-8
More:
By adding u before the Chinese code, everything will be right, it is great way. Like below:
Without u there, sometimes it will go with corrupted code. Like below:

Python 2.x uses ASCII as default encoding, while Python 3.x uses UTF-8. Just use:
my_string.encode("utf-8")
to convert ascii to utf-8 (or change it to any other encoding you need)
You can also try to put this line on the first line of your code:
# -*- coding: utf-8 -*-

Python 2 uses ASCII as its default encoding for its strings which cannot store Chinese characters. On the other hand, Python 3 uses Unicode encoding for its strings by default which can store Chinese characters.
But that doesn't mean Python 2 cannot use Unicode strings. You just have to encode your strings into Unicode. Here's an example of converting your strings to Unicode strings.
>>> plain_text = "Plain text"
>>> plain_text
'Plain text'
>>> utf8_text = unicode(plain_text, "utf-8")
>>> utf8_txt
u'Plain_text'
The prefix u in the string, utf8_txt, says that it is a Unicode string.
You could also do this.
>>> print u"你好"
>>> 你好
You just have to prepend your string with u to signify that it is a Unicode string.

When using Python 2 on Windows:
For file with Unicode characters to be saved in IDLE, a line
# -*- coding: utf-8 -*-
has to be added in its beginning.
And for Unicode characters to show correctly in console output in Windows, if running a script, saved in a file, in IDLE console or in Windows shell, strings have to be prepended with u:
print u"你好"
print u"Привет"
But in interactive mode, I discovered no need for this with cyrillic.

even in python 3.7 I still experience the same issue, UTF-8 still does the trick

Trying to print check mark U+2713 in Python on Windows produces UnicodeEncodeError

I have read through Print the "approval" sign/check mark (✓) U+2713 in Python, but none of the answers work for me. I am running Python 2.7 on Windows.
print u'\u2713'
produces this error:
exceptions.UnicodeEncodeError: 'charmap' codec can't encode characters in position 0-1: character maps to
This:
#!/usr/bin/python
# -*- coding: UTF-8 -*-
print '✓'
does not work because I am using Windows.
print u'\u2713'.encode('utf8')
Prints out âœ“, which is not the right character.
print('\N{check mark}')
Is just silly. This prints \N{check mark} literally.

Read http://www.joelonsoftware.com/articles/Unicode.html and you will understand what is going on.
The bad news is: you won't be able to print that caracter because it simply is not available in the default text encoding of your Windows terminal. Modify your terminal configuration to use "utf-8" instead of the default "cp-852" or whatever is the default for Window's cmd these days, and you should be good, but do so only after reading the above article, seriously.

What encoding do I need to display a GBP sign (pound sign) using python on cygwin in Windows XP?

I have a python (2.5.4) script which I run in cygwin (in a DOS box on Windows XP). I want to include a pound sign (£) in the output. If I do so, I get this error:
SyntaxError: Non-ASCII character '\xa3' in file dbscan.py on line 253, but no encoding declared; see http://www.python.org/peps/pep-0263.html for details
OK. So I looked at that PEP, and now tried adding this to the beginning of my script:
# coding=cp437
That stopped the error, but the output shows ú where it should show £.
I've tried ISO-8859-1 as well, with the same result.
Does anyone know which encoding I need?
Or where I could look to find out?

The Unicode for a pound sign is 163 (decimal) or A3 in hex, so the following should work regardless of the encoding of your script, as long as the output encoding is working correctly.
print u"\xA3"

try the encoding :
# -*- coding: utf-8 -*-
and then to display the '£' sign:
print unichr(163)

There are two encodings involved here:
The encoding of your source code, which must be correct in order for your input file to mean what you think it means
The encoding of the output, which must be correct in order for the symbols emitted to display as expected.
It seems your output encoding is off now. If this is running in a terminal window in Cygwin, it is that terminal's encoding that you need to match.
EDIT: I just ran the following Python program in a (native) Windows XP terminal window, thought it was slightly interesting:
>>> ord("£")
156
156 is certainly not the codepoint for the pound sign in the Latin1 encoding you tried. It doesn't seem to be in Window's Codepage 1252 either, which I would expect my terminal to use ... Weird.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.