Cyrillic chars in Python 2.7 - python

In my script I pointed 1251 codepage. But Python 2.7.13 output incorrectly shows some cyrillic strings:
Программа 'Game Over' 2.0
('\xd2\xee \xe6\xe5', '\xf1\xe0\xec\xee\xe5', '\xf1\xee\xee\xe1\xf9\xe5\xed\xe8\xe5')
('\xd2\xee\xeb\xfc\xea\xee', '\xf7\xf3\xf2\xfc-\xf7\xf3\xf2\xfc', '\xef\xee\xe1\xee\xeb\xfc\xf8\xe5')
оно...
GAME OVER
Нажмите Enter для выхода...
I read this and this topics before but it didn't help me. I tried such variants:
# -*- coding: utf-8 -*-
# -*- coding: cp1251 -*-
Why does it happen and how can I fix it?
At the same time Python 3.6.0 output writes all cyrillic chars correctly even without the codepage pointing:
Программа 'Game Over' 2.0
То же самое сообщение
Только чуть-чуть побольше
оно...
GAME OVER
Нажмите Enter для выхода...
My code:
# coding: cp1251
# game_over.py
# © Andrey Bushman, 2017
print("Программа 'Game Over' " + "2.0")
print("То же", "самое", "сообщение")
print("Только", "чуть-чуть", "побольше")
#print("Вот", end=" ")
print("оно...")
print("""
GAME OVER
""")
print("\a")
input("\n\nНажмите Enter для выхода...")

For 2.7, you should make the strings unicode strings by using the u prefix. The following works both in IDLE and the console (when the console codepage is set to 1251 with chcp 1251).
# coding: utf_8
# game_over.py
# Andrey Bushman, 2017
from __future__ import print_function
print(u"Программа 'Game Over' 2.0"
)
print (u"То же самое сообщение")
print(u"Только чуть-чуть побольше")
print(u"оно...")
print("""
GAME OVER
""")
print(u"\n\nНажмите Enter для выхода...", end='')
a = raw_input()
I separated the prompt and input because input(u'xxxx') was not working. raw_input is needed in 2.x to avoid evaluating the input.

print("То же", "самое", "сообщение")
Nothing to do with Cyrillic -- Python 2's print statement doesn't have parentheses.
So here you're printing the tuple ("То же", "самое", "сообщение"), not a string. This does the same thing:
tmp = ("То же", "самое", "сообщение")
print tmp
Either remove the parentheses, or add from __future__ import print_function at the top of your module.

I spend quite some time figuring out how to use Python 2.7 properly with non-latin1 code pages. The easiest solution I found, by far, is to switch to Python 3. Nothing else comes remotely close to it.

The print statement in python2 evaluates each comma-separated expression within the brackets and converts them to a string before it's printed. That's why each cyrillic character is converted to ASCII when you separate the values with commas.
What you can do is the following:
import codecs
text = ("То же", "самое", "сообщение")
for i in text:
(codecs.decode(i, 'utf-8'))
Or:
text = ("То же", "самое", "сообщение")
print(' '.join(text))
Make sure you have the following line at the top of your python script if you're using python2.
# -*- coding: utf-8 -*-

Short answer: If you want to print chars other than ascii or those in your default codepage on Windows, use 3.6+. Explanation below.
To properly read a file, the encoding declaration must match the actual encoding of the bytes in the file. If you use a limited (non-utf) encoding and want to print strings to Command Prompt, the limited encoding and the console encoding should also match. Or rather, the subset of unicode that you try to print must be included the the subset that the console will accept.
In this case, if you declare the encoding as cp1251 and save it with IDLE, then IDLE appears to save it with that encoding. By definition, the only chars in the file must be in the cp1251 subset. When you print those characters, the console must accept at least the same subset. You can make Command Prompt accept Russian by running chcp 1251 as a command. (chcp == CHange CodePage.) Warning: this command only affects the current Command Prompt window. Anyway, by matching the encoding declaration and the console codepage, I got your code to run on 2.7, 3.5, and 3.6 in the console (but not in IDLE 2.7). But of course, non-ascii, non-cyrillic chars generated by your code will not print.
In 3.x, Python expects code to be utf_8 by default. For 3.6, Python's interface to Windows' consoles was re-written to put the console in utf_8 mode. So write code in an editor that saves it as utf_8 and, as you noticed, printing to the console in Windows works in 3.6. (In 3.x, printing to IDLE's shell has always worked for all the Basic Multilingual Plane (BMP subset) of unicode. Not working for higher codepoints is a current limitation of tk and hence tkinter, which IDLE uses.)

Related

Comparison of Non ASCII only works in IDLE

I'm doing a fairly simple code that transforms European Portuguese input into Brazilian Portuguese -- so there are a lot of accented characters such as á,é,À,ç, etc.
Basically, the goal is to find words in the text from a list and replace them with the BR words from a second list.
Here's the code:
#-*- coding: latin-1 -*-
listapt=["gestão","utilizador","telemóvel"]
listabr=["gerenciamento", "usuário", "celular"]
while True:
#this is all because I need to be able to input multiple lines of text, seems to be working fine
print ("Insert text")
lines = []
while True:
line = raw_input()
if line != "FIM":
lines.append(line)
else:
break
text = '\n'.join(lines)
for word in listapt:
if word in text:
num = listapt.index(word)
wordbr = listabr[num]
print(word + " --> " + wordbr) #just to show what changes were made
text = text.replace(word, wordbr)
print(text)
I run the code on Windows using IDLE and by double-clicking on the .py file.
The code works fine when using IDLE, but does not match and replace characters when double-clicking the .py file.
Here's why the code works as expected in IDLE but not from CMD or by doubleclicking:
Your code is UTF-8 encoded, not latin-1 encoded
IDLE always works in UTF-8 "input/output" mode.
On Windows, CMD/Doubleclicking will use a non-UTF-8 8bit locale.
When your code compares the input to the hardcoded strings it's doing so at a byte level. On IDLE, it's comparing UTF-8 to hardcoded UTF-8. On CMD, it's comparing non-UTF-8 8bit to hardcoded UTF-8 (If you were on a stock MacOS, it would also work).
The way to fix this is to make sure you're comparing "apples with apples". You could do this by converting everything to the same encoding. E.g. Convert the input read to UTF-8 so it matches the hardcoded strings. The better solution is to convert all [byte] strings to Unicode strings (Strings with no encoding). If you were on Python 3, this would be all automatic.
On Python 2.x, you need to do three things:
Prefix all sourcecode strings with u to make them Unicode strings:
listapt=[u"gestão",u"utilizador",u"telemóvel"]
listabr=[u"gerenciamento",u"usuário", u"celula]
...
if line != u"FIM":
Alternatively, add from __future__ import unicode_literals to avoid changing all your code.
Use the correct coding header for the encoding of your file. I suspect your header should read utf-8. E.g.
#-*- coding: utf-8 -*-
Convert the result of raw_input to Unicode. This must be done with the detected encoding of the standard input:
import sys
line = raw_input().decode(sys.stdin.encoding)
By the way, the better way to model list of words to replace it to use a dict. The keys are the original word, the value is the replacement. E.g.
words = { u"telemóvel": u"celula"}
I don't see that problem over here.
Based on your use of raw_input, it seems like you're using Python 2.x
This may be because I'm copypasting off of stack overflow, and have a different dev environment to you.
Try running your script under the latest Python 3 interpreter, as well as removing the "#-*- coding:" line.
This should either hit UnicodeDecodeError issues a lot sooner in your code, or work fine.
The problem you have here is Python 2.x getting confused at some point while trying to translate between byte sequences (what Python 2.x strings contain, eg binary file contents), and human-meaningful text (unicode, eg for things like user informational display of chinese characters), because it makes incorrect assumptions about how human-readable text was encoded into the byte sequence seen in the Python strings.
It's a detail that Python 3 attempts to address a lot better/less ambiguously.
First try executing the code below, it should resolve the issue:
# -*- coding: latin-1 -*-
listapt=[u"gestão",u"utilizador",u"telemóvel"]
listabr=[u"gerenciamento",u"usuário", u"celular"]
lines=[]
line = raw_input()
line = line.decode('latin-1')
if line != "FIM":
lines.append(line)
text = u'\n'.join(lines)
for word in listapt:
if word in text:
print("Hello")
num = listapt.index(word)
print(num)
wordbr = listabr[num]
print(wordbr)

Unicode with cp1251 and utf-8 on windows

I am playing around with unicode in python.
So there is a simple script:
# -*- coding: cp1251 -*-
print 'юникод'.decode('cp1251')
print unicode('юникод', 'cp1251')
print unicode('юникод', 'utf-8')
In cmd I've switched encoding to Active code page: 1251.
And there is the output:
СЋРЅРёРєРѕРґ
СЋРЅРёРєРѕРґ
юникод
I am a little bit confused.
Since I've specified encoding to cp1251 I expect that it would be decoded correctly.
But as result there is some trash code points were interpreted.
I am understand that 'юникод' is just a bytes like:
'\xd1\x8e\xd0\xbd\xd0\xb8\xd0\xba\xd0\xbe\xd0\xb4'.
But there is a way to get correct output in terminal with cp1251?
Should I build byte string manually?
Seems like I misunderstood something.
I think I can understand what happened to you. The last line gave me the hint, that your trash codepoints confirmed. You try to display cp1251 characters but your editor is configured to use utf8.
The # -*- coding: cp1251 -*- is only used by the Python interpretor to convert characters from source python files that are outside of the ASCII range. And anyway it it is only used for unicode litterals because bytes from original source give er... exactly same bytes in byte strings. Some text editors are kind enough to automagically use this line (IDLE editor is), but I'm little confident in that and allways switch manually to the proper encoding when I use gvim for example. Short story: # -*- coding: cp1251 -*- in unused in your code and can only mislead a reader since it it not the actual encoding.
If you want to be sure of what lies in your source, you'd better use explicit escapes. In code page 1251, this word юникод is composed by those characters: '\xfe\xed\xe8\xea\xee\xe4'
If you write this source:
txt = '\xfe\xed\xe8\xea\xee\xe4'
print txt
print txt.decode('cp1251')
print unicode(txt, 'cp1251')
print unicode(txt, 'utf-8')
and execute it in a console configured to use CP1251 charset, the first three lines will output юникод, and the last one will throw a UnicodeDecodeError exception because the input is no longer valid 'utf8'.
Alternatively, if you find comfortable with you current editor, you could write:
# -*- coding: utf8 -*-
txt = 'юникод'.decode('utf8').encode('cp1251') # or simply txt = u'юникод'.encode('cp1251')
print txt
print txt.decode('cp1251')
print unicode(txt, 'cp1251')
print unicode(txt, 'utf-8')
which should give same results - but now the declared source encoding should be the actual encoding of your python source.
BTW, a Python 3.5 IDLE that natively uses unicode confirmed that:
>>> 'СЋРЅРёРєРѕРґ'.encode('cp1251').decode('utf8')
'юникод'
Your issue is that the encoding declaration is wrong: your editor uses utf-8 character encoding to save the source code. Use # -*- coding: utf-8 -*- to fix it.
>>> u'юникод'
u'\u044e\u043d\u0438\u043a\u043e\u0434'
>>> u'юникод'.encode('utf-8')
'\xd1\x8e\xd0\xbd\xd0\xb8\xd0\xba\xd0\xbe\xd0\xb4'
>>> print _.decode('cp1251') # mojibake due to the wrong encoding
СЋРЅРёРєРѕРґ
>>> print u'юникод'
юникод
Do not use bytestrings ('' literals create bytes object on Python 2) to represent text; use Unicode strings (u'' literals -- unicode type) instead.
If your code uses Unicode strings then a code page that your Windows console uses doesn't matter as long as the chosen font can display the corresponding (non-BMP) characters. See Python, Unicode, and the Windows console
Here's complete code, for reference:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
print(u'юникод')
Note: no .decode(), unicode(). If you are using a literal to create a string; you should use Unicode literals if the string contains text. It is the only option on Python 3 where you can't put non-ascii characters inside a bytes literal and it is a good practice (to use Unicode for text instead of bytestrings) on Python 2 too.
If you are given a bytestring as an input (not literal) by some API then its encoding has nothing to do with the encoding declaration. What specific encoding to use depends on the source of the data.
Just use the following, but ensure you save the source code in the declared encoding. It can be any encoding that supports the characters you want to print. The terminal can be in a different encoding, as long as it also supports the characters you want to print:
#coding:utf8
print u'юникод'
The advantage is that you don't need to know the terminal's encoding. Python will normally1 detect the terminal encoding and encode the print output correctly.
1Unless your terminal is misconfigured.

Why I cannot save file with Chinese characters when using Python 2.7.11 IDLE?

I just downloaded the latest Python 2.7.11 64bit from its official website and installed it to my Windows 10. And I found that if the new IDLE file contains Chinese character, like 你好, then I cannot save the file. If I tried to save it for several times, then the new file crashed and disappeared.
I also installed the latest python-3.5.1-amd64.exe, and it does not have this issue.
How to solve it?
More:
A example code from wiki page, https://zh.wikipedia.org/wiki/%E9%B8%AD%E5%AD%90%E7%B1%BB%E5%9E%8B
If I past the code here, StackOverflow alays warn me: Body cannot contain "I just dow". Why?
Thanks!
More:
I find this config option, but it does not help at all.
IDLE -> Options -> Configure IDLE -> General -> Default Source Encoding: UTF-8
More:
By adding u before the Chinese code, everything will be right, it is great way. Like below:
Without u there, sometimes it will go with corrupted code. Like below:
Python 2.x uses ASCII as default encoding, while Python 3.x uses UTF-8. Just use:
my_string.encode("utf-8")
to convert ascii to utf-8 (or change it to any other encoding you need)
You can also try to put this line on the first line of your code:
# -*- coding: utf-8 -*-
Python 2 uses ASCII as its default encoding for its strings which cannot store Chinese characters. On the other hand, Python 3 uses Unicode encoding for its strings by default which can store Chinese characters.
But that doesn't mean Python 2 cannot use Unicode strings. You just have to encode your strings into Unicode. Here's an example of converting your strings to Unicode strings.
>>> plain_text = "Plain text"
>>> plain_text
'Plain text'
>>> utf8_text = unicode(plain_text, "utf-8")
>>> utf8_txt
u'Plain_text'
The prefix u in the string, utf8_txt, says that it is a Unicode string.
You could also do this.
>>> print u"你好"
>>> 你好
You just have to prepend your string with u to signify that it is a Unicode string.
When using Python 2 on Windows:
For file with Unicode characters to be saved in IDLE, a line
# -*- coding: utf-8 -*-
has to be added in its beginning.
And for Unicode characters to show correctly in console output in Windows, if running a script, saved in a file, in IDLE console or in Windows shell, strings have to be prepended with u:
print u"你好"
print u"Привет"
But in interactive mode, I discovered no need for this with cyrillic.
even in python 3.7 I still experience the same issue, UTF-8 still does the trick

How to print symbols like ● to files in Python

I'm trying to write the symbol ● to a text file in python. I think it has something to do with the encoding (utf-8). Here is the code:
# -*- coding: utf-8 -*-
outFile = open('./myFile.txt', 'wb')
outFile.write("●")
outFile.close()
Instead of the black "●" I get "â—". How can I fix this?
Open the file using the io package for this to work with both python2 and python3 with encoding set to utf8 for this to work. When printing, When writing, write as a unicode string.
import io
outFile = io.open('./myFile.txt', 'w', encoding='utf8')
outFile.write(u'●')
outFile.close()
Tested on Python 2.7.8 and Python 3.4.2
If you are using Python 2, use codecs.open instead of open and unicode instead of str:
# -*- coding: utf-8 -*-
import codecs
outFile = codecs.open('./myFile.txt', 'wb', 'utf-8')
outFile.write(u"●")
outFile.close()
In Python 3, pass the encoding keyword argument to open:
# -*- coding: utf-8 -*-
outFile = open('./myFile.txt', 'w', encoding='utf-8')
outFile.write("●")
outFile.close()
>>> ec = u'\u25cf' # unicode("●", "UTF-8")
>>> open("/tmp/file.txt", "w").write(ec.encode('UTF-8'))
This should do the trick
# -*- coding: utf-8 -*-
outFile = open('./myFile.txt', 'wb')
outFile.write(u"\u25CF".encode('utf-8'))
outFile.close()
have a look at this
What your program does is to produce an output file in the same encoding as your program editor (the coding at the top does not matter, unless your program editor uses it for saving the file). Thus, if you open myFile.txt with a program that uses the same encoding as your program editor, everything looks fine.
This does not mean that your program works for everybody.
For this, you must do two things. You must first indicate the encoding used for text files on your machine. This is a little hard to detect, but the following should often work:
# coding=utf-8 # Put your editor's encoding here
import codecs
import locale
import sys
# Selection of the first non-None, reasonable encoding:
out_encoding = (locale.getlocale()[1]
or locale.getpreferredencoding()
or sys.stdin.encoding or sys.stdout.encoding
# Default:
or "UTF8")
outFile = codecs.open('./myFile.txt', 'w', out_encoding)
Note that it is very important to specify the right coding on top of the file: this must be your program editor's encoding.
If you know the encoding you want for your output file, you can directly put it in open(). Otherwise, the more general and portable out_encoding expression above should work for most users on most computers (i.e., whatever their encoding of choice is, they should be able to read "●" in the resulting file—assuming their computer's encoding can represent it).
Then you must print a string, not bytes:
outFile.write(u"●")
(note the leading u, meaning "unicode string").
For a deeper understanding of the issues at hand, one of my previous answers should be very helpful: UnicodeDecodeError when redirecting to file.
I'm very sorry, but writing a symbol to a text file without saying what the encoding of the file should be is simply non-sense.
It may not be evident at first sight, but text files are indeed encoded and may be encoded in different ways. If you have only letters (upper and lower case, but not accented oned), digits and simple symbols (everything that has an ASCII code below 128), all should be fine, because ASCII 7 bits is now a standard and in fact those characters have same representation in major encodings.
But as soon as you get true symbols, or accented chars, their representation vary from one encoding to the other. For example, the symbol ● has a UTF-8 representation of (Python coding) : \xe2\x97\x8f. What is worse, it cannot be represented in latin1 (ISO-8859-1) encoding.
Another example is the french e accent aigu : é it is represented in UTF8 as \xc3\xa9 (note 2 bytes), but is represented in Latin1 as \x89 (one single byte)
So I tested your code in my Ubuntu box using a UTF8 encoding and the command
cat myFile.txt ... correctly showed the bullet !
sba#sba-ubuntu:~/stackoverflow$ cat myFile.txt
●sba#sba-ubuntu:~/stackoverflow$
(as you didn't add any newline after the bullet, the prompt immediately follows it)
In conclusion :
Your code correctly writes the bullet to the file in UTF8 encoding. If your system uses natively another encoding (ISO-8859-1 or its variant Windows-1252) you cannot natively convert it because this character simply does not exist in this encodings.
But you can always see it in a text editor that supports different encoding like the excellent vim that exists on all major systems.
Proof of above :
On a Windows 7 computer, I opened a vim window and instructed it to accept utf8 with :set encoding='utf8'. I then pasted original code from OP and saved it to a file foo.py.
I opened a cmd.exe window and executed python foo.py (using a Python 2.7) : it created a file myFile.txt containing the 3 bytes (hexa) : e2 97 8f that is the utf8 representation of the bullet ● (I could confirm it with vim Tools/Hexa convert).
I could even open myFile.txt in idle and actually saw the bullet. Even notepad.exe could show the bullet !
So even on a Windows 7 computer that does not natively accept utf-8, the code from OP correctly generates a text file that when opened with a text editor accepting UTF-8 contains the bullet ●.
Of course, if I try to open myFile.txt with vim in latin1 mode, I get : â—, on a cmd windows with codepage 850, type myFile.txt shows ÔùÅ, and with codepage 1252 (variant of latin1) : â—.
In conclusion original OP code creates a correct utf8 encoded file - it is up to the reading part to interpret correctly utf8.

python utf-8 japanese

I have some Japanese words I wish to convert to utf-8, as shown below:
jap_word1 = u'中山'
jap_word2 = u'小倉'
print jap_word1.encode('utf-8') # Doesn't work
print jap_word2.encode('utf-8') # Prints properly
Why is it that one word can be converted properly to utf-8 and printed to show the same characters but not the other?
(I am using python 2.6 on Windows 7 Ultimate)
Lots of things must align to print characters properly:
What encoding is the script saved in?
Do you have a # coding: xxxx statement in your script, where xxxx matches the encoding the file is saved in?
Does your output terminal support your encoding? import sys; print sys.stdout.encoding
a. If not, can you change the console encoding? (chcp command on Windows)
Does the font you are using support the characters?
Saving the script in UTF-8, this works in both PythonWin and IDLE.
# coding: utf-8
jap_word1 = u'中山'
jap_word2 = u'小倉'
print jap_word1
print jap_word2
Interestingly, I got your results with the .encode('utf-8') added to both prints in IDLE, but it worked correctly in Pythonwin, whose default output window supports UTF-8.
Idle is a strange beast. sys.stdout.encoding on my system produces 'cp1252', which doesn't support Asian characters, but it prints the first word wrong and the second one right when printing in UTF-8.
Because your console is not in UTF-8. Run chcp 65001 before running.

Categories