Unicode with cp1251 and utf-8 on windows

Unicode with cp1251 and utf-8 on windows - python

I am playing around with unicode in python.
So there is a simple script:
# -*- coding: cp1251 -*-
print 'юникод'.decode('cp1251')
print unicode('юникод', 'cp1251')
print unicode('юникод', 'utf-8')
In cmd I've switched encoding to Active code page: 1251.
And there is the output:
СЋРЅРёРєРѕРґ
СЋРЅРёРєРѕРґ
юникод
I am a little bit confused.
Since I've specified encoding to cp1251 I expect that it would be decoded correctly.
But as result there is some trash code points were interpreted.
I am understand that 'юникод' is just a bytes like:
'\xd1\x8e\xd0\xbd\xd0\xb8\xd0\xba\xd0\xbe\xd0\xb4'.
But there is a way to get correct output in terminal with cp1251?
Should I build byte string manually?
Seems like I misunderstood something.

I think I can understand what happened to you. The last line gave me the hint, that your trash codepoints confirmed. You try to display cp1251 characters but your editor is configured to use utf8.
The # -*- coding: cp1251 -*- is only used by the Python interpretor to convert characters from source python files that are outside of the ASCII range. And anyway it it is only used for unicode litterals because bytes from original source give er... exactly same bytes in byte strings. Some text editors are kind enough to automagically use this line (IDLE editor is), but I'm little confident in that and allways switch manually to the proper encoding when I use gvim for example. Short story: # -*- coding: cp1251 -*- in unused in your code and can only mislead a reader since it it not the actual encoding.
If you want to be sure of what lies in your source, you'd better use explicit escapes. In code page 1251, this word юникод is composed by those characters: '\xfe\xed\xe8\xea\xee\xe4'
If you write this source:
txt = '\xfe\xed\xe8\xea\xee\xe4'
print txt
print txt.decode('cp1251')
print unicode(txt, 'cp1251')
print unicode(txt, 'utf-8')
and execute it in a console configured to use CP1251 charset, the first three lines will output юникод, and the last one will throw a UnicodeDecodeError exception because the input is no longer valid 'utf8'.
Alternatively, if you find comfortable with you current editor, you could write:
# -*- coding: utf8 -*-
txt = 'юникод'.decode('utf8').encode('cp1251') # or simply txt = u'юникод'.encode('cp1251')
print txt
print txt.decode('cp1251')
print unicode(txt, 'cp1251')
print unicode(txt, 'utf-8')
which should give same results - but now the declared source encoding should be the actual encoding of your python source.
BTW, a Python 3.5 IDLE that natively uses unicode confirmed that:
>>> 'СЋРЅРёРєРѕРґ'.encode('cp1251').decode('utf8')
'юникод'

Your issue is that the encoding declaration is wrong: your editor uses utf-8 character encoding to save the source code. Use # -*- coding: utf-8 -*- to fix it.
>>> u'юникод'
u'\u044e\u043d\u0438\u043a\u043e\u0434'
>>> u'юникод'.encode('utf-8')
'\xd1\x8e\xd0\xbd\xd0\xb8\xd0\xba\xd0\xbe\xd0\xb4'
>>> print _.decode('cp1251') # mojibake due to the wrong encoding
СЋРЅРёРєРѕРґ
>>> print u'юникод'
юникод
Do not use bytestrings ('' literals create bytes object on Python 2) to represent text; use Unicode strings (u'' literals -- unicode type) instead.
If your code uses Unicode strings then a code page that your Windows console uses doesn't matter as long as the chosen font can display the corresponding (non-BMP) characters. See Python, Unicode, and the Windows console
Here's complete code, for reference:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
print(u'юникод')
Note: no .decode(), unicode(). If you are using a literal to create a string; you should use Unicode literals if the string contains text. It is the only option on Python 3 where you can't put non-ascii characters inside a bytes literal and it is a good practice (to use Unicode for text instead of bytestrings) on Python 2 too.
If you are given a bytestring as an input (not literal) by some API then its encoding has nothing to do with the encoding declaration. What specific encoding to use depends on the source of the data.

Just use the following, but ensure you save the source code in the declared encoding. It can be any encoding that supports the characters you want to print. The terminal can be in a different encoding, as long as it also supports the characters you want to print:
#coding:utf8
print u'юникод'
The advantage is that you don't need to know the terminal's encoding. Python will normally1 detect the terminal encoding and encode the print output correctly.
1Unless your terminal is misconfigured.

Related

Why I cannot save file with Chinese characters when using Python 2.7.11 IDLE?

I just downloaded the latest Python 2.7.11 64bit from its official website and installed it to my Windows 10. And I found that if the new IDLE file contains Chinese character, like 你好, then I cannot save the file. If I tried to save it for several times, then the new file crashed and disappeared.
I also installed the latest python-3.5.1-amd64.exe, and it does not have this issue.
How to solve it?
More：
A example code from wiki page, https://zh.wikipedia.org/wiki/%E9%B8%AD%E5%AD%90%E7%B1%BB%E5%9E%8B
If I past the code here, StackOverflow alays warn me: Body cannot contain "I just dow". Why?
Thanks!
More:
I find this config option, but it does not help at all.
IDLE -> Options -> Configure IDLE -> General -> Default Source Encoding: UTF-8
More:
By adding u before the Chinese code, everything will be right, it is great way. Like below:
Without u there, sometimes it will go with corrupted code. Like below:

Python 2.x uses ASCII as default encoding, while Python 3.x uses UTF-8. Just use:
my_string.encode("utf-8")
to convert ascii to utf-8 (or change it to any other encoding you need)
You can also try to put this line on the first line of your code:
# -*- coding: utf-8 -*-

Python 2 uses ASCII as its default encoding for its strings which cannot store Chinese characters. On the other hand, Python 3 uses Unicode encoding for its strings by default which can store Chinese characters.
But that doesn't mean Python 2 cannot use Unicode strings. You just have to encode your strings into Unicode. Here's an example of converting your strings to Unicode strings.
>>> plain_text = "Plain text"
>>> plain_text
'Plain text'
>>> utf8_text = unicode(plain_text, "utf-8")
>>> utf8_txt
u'Plain_text'
The prefix u in the string, utf8_txt, says that it is a Unicode string.
You could also do this.
>>> print u"你好"
>>> 你好
You just have to prepend your string with u to signify that it is a Unicode string.

When using Python 2 on Windows:
For file with Unicode characters to be saved in IDLE, a line
# -*- coding: utf-8 -*-
has to be added in its beginning.
And for Unicode characters to show correctly in console output in Windows, if running a script, saved in a file, in IDLE console or in Windows shell, strings have to be prepended with u:
print u"你好"
print u"Привет"
But in interactive mode, I discovered no need for this with cyrillic.

even in python 3.7 I still experience the same issue, UTF-8 still does the trick

How to print symbols like ● to files in Python

I'm trying to write the symbol ● to a text file in python. I think it has something to do with the encoding (utf-8). Here is the code:
# -*- coding: utf-8 -*-
outFile = open('./myFile.txt', 'wb')
outFile.write("●")
outFile.close()
Instead of the black "●" I get "â—". How can I fix this?

Open the file using the io package for this to work with both python2 and python3 with encoding set to utf8 for this to work. When printing, When writing, write as a unicode string.
import io
outFile = io.open('./myFile.txt', 'w', encoding='utf8')
outFile.write(u'●')
outFile.close()
Tested on Python 2.7.8 and Python 3.4.2

If you are using Python 2, use codecs.open instead of open and unicode instead of str:
# -*- coding: utf-8 -*-
import codecs
outFile = codecs.open('./myFile.txt', 'wb', 'utf-8')
outFile.write(u"●")
outFile.close()
In Python 3, pass the encoding keyword argument to open:
# -*- coding: utf-8 -*-
outFile = open('./myFile.txt', 'w', encoding='utf-8')
outFile.write("●")
outFile.close()

>>> ec = u'\u25cf' # unicode("●", "UTF-8")
>>> open("/tmp/file.txt", "w").write(ec.encode('UTF-8'))

This should do the trick
# -*- coding: utf-8 -*-
outFile = open('./myFile.txt', 'wb')
outFile.write(u"\u25CF".encode('utf-8'))
outFile.close()
have a look at this

What your program does is to produce an output file in the same encoding as your program editor (the coding at the top does not matter, unless your program editor uses it for saving the file). Thus, if you open myFile.txt with a program that uses the same encoding as your program editor, everything looks fine.
This does not mean that your program works for everybody.
For this, you must do two things. You must first indicate the encoding used for text files on your machine. This is a little hard to detect, but the following should often work:
# coding=utf-8 # Put your editor's encoding here
import codecs
import locale
import sys
# Selection of the first non-None, reasonable encoding:
out_encoding = (locale.getlocale()[1]
or locale.getpreferredencoding()
or sys.stdin.encoding or sys.stdout.encoding
# Default:
or "UTF8")
outFile = codecs.open('./myFile.txt', 'w', out_encoding)
Note that it is very important to specify the right coding on top of the file: this must be your program editor's encoding.
If you know the encoding you want for your output file, you can directly put it in open(). Otherwise, the more general and portable out_encoding expression above should work for most users on most computers (i.e., whatever their encoding of choice is, they should be able to read "●" in the resulting file—assuming their computer's encoding can represent it).
Then you must print a string, not bytes:
outFile.write(u"●")
(note the leading u, meaning "unicode string").
For a deeper understanding of the issues at hand, one of my previous answers should be very helpful: UnicodeDecodeError when redirecting to file.

I'm very sorry, but writing a symbol to a text file without saying what the encoding of the file should be is simply non-sense.
It may not be evident at first sight, but text files are indeed encoded and may be encoded in different ways. If you have only letters (upper and lower case, but not accented oned), digits and simple symbols (everything that has an ASCII code below 128), all should be fine, because ASCII 7 bits is now a standard and in fact those characters have same representation in major encodings.
But as soon as you get true symbols, or accented chars, their representation vary from one encoding to the other. For example, the symbol ● has a UTF-8 representation of (Python coding) : \xe2\x97\x8f. What is worse, it cannot be represented in latin1 (ISO-8859-1) encoding.
Another example is the french e accent aigu : é it is represented in UTF8 as \xc3\xa9 (note 2 bytes), but is represented in Latin1 as \x89 (one single byte)
So I tested your code in my Ubuntu box using a UTF8 encoding and the command
cat myFile.txt ... correctly showed the bullet !
sba#sba-ubuntu:~/stackoverflow$ cat myFile.txt
●sba#sba-ubuntu:~/stackoverflow$
(as you didn't add any newline after the bullet, the prompt immediately follows it)
In conclusion :
Your code correctly writes the bullet to the file in UTF8 encoding. If your system uses natively another encoding (ISO-8859-1 or its variant Windows-1252) you cannot natively convert it because this character simply does not exist in this encodings.
But you can always see it in a text editor that supports different encoding like the excellent vim that exists on all major systems.
Proof of above :
On a Windows 7 computer, I opened a vim window and instructed it to accept utf8 with :set encoding='utf8'. I then pasted original code from OP and saved it to a file foo.py.
I opened a cmd.exe window and executed python foo.py (using a Python 2.7) : it created a file myFile.txt containing the 3 bytes (hexa) : e2 97 8f that is the utf8 representation of the bullet ● (I could confirm it with vim Tools/Hexa convert).
I could even open myFile.txt in idle and actually saw the bullet. Even notepad.exe could show the bullet !
So even on a Windows 7 computer that does not natively accept utf-8, the code from OP correctly generates a text file that when opened with a text editor accepting UTF-8 contains the bullet ●.
Of course, if I try to open myFile.txt with vim in latin1 mode, I get : â—, on a cmd windows with codepage 850, type myFile.txt shows ÔùÅ, and with codepage 1252 (variant of latin1) : â—.
In conclusion original OP code creates a correct utf8 encoded file - it is up to the reading part to interpret correctly utf8.

What is the default encoding method for code assumed by Python interpreter?

Some people use the following to declare the encoding method for the text of their Python source code:
# -*- coding: utf-8 -*-
Back in 2001, it is said the default encoding method that Python interpreter assumes is ASCII. I have dealt with strings using non-ASCII characters in my Python code, without declaring encoding method of my code, and I don't remember I have bumped into encoding error before. What is the default encoding for code assumed by Python interpreter now?
I am not sure if this is relevant.
My OS is Ubuntu, and I am using the default Python interpreter, and gedit or emacs for editing.
Will the default encoding method by Python interpreter changes if the above changes?
Thanks.

Without any explicit encoding declaration, the assumed encoding for your source code will be
ascii for Python 2.x
utf-8 for Python 3.x
See PEP 0263 and Using source code encoding for Python 2.x, and PEP 3120 for the new default of utf-8 for Python 3.x.
So the default encoding assumened for source code will be directly dependent of the version of the Python interpreter, and it is not configurable.
Note that the source code encoding is something entirely different than dealing with non-ASCII characters as part of your data in strings.
There are two distinct cases where you may encounter non-ASCII characters:
As part of your programs data, during runtime
As part of your source code (and since you can't have non-ASCII characters in identifiers, that usually means hard coded string data in your source code or comments).
The source code encoding declaration affects what encoding your source code will be interpreted with - so it's only needed if you decide to directly put non-ASCII characters in your source code.
So, the following code will eventually have to deal with the fact that there might be non-ASCII characters in data.txt:
with open('data.txt') as f:
for line in f:
# do something with `line`
But it doesn't contain any non-ASCII characters in the source code, therefore it doesn't need an encoding declaration at the top of the file. It will however need to properly decode line if it wants to turn it into unicode. Simply doing unicode(line) will use the system default encoding, which is ascii (different from the default source encoding, but happens to also be ascii). So to explicitely decode the string using utf-8 you'd need to do line.decode('utf-8').
This code however does contain non-ASCII characters directly in its source code:
TEST_DATA = 'Bär' # <--- non-ASCII character on this line
print TEST_DATA
And it will fail with a SyntaxError similar to this, unless you declare an explicit source code encoding:
SyntaxError: Non-ASCII character '\xc3' in file foo.py on line 1, but no encoding declared;
see http://www.python.org/peps/pep-0263.html for details
So assuming your text editor is configured to save files in utf-8, you'd need to put the line
# -*- coding: utf-8 -*-
at the top of the file for Python to interpret the source code correctly.
My advice however would be to generally avoid putting non-ASCII characters in your source code, exactly because if it depends on your and your co-workers editor and terminal settings wheter it will be written and read correctly.
Instead you can use escaped strings to safely enter non-ASCII characters in your code:
TEST_DATA = 'B\xc3\xa4r'

By default, Python source files are treated as encoded in UTF-8. In that encoding, — although the standard library only uses ASCII characters for identifiers, a convention that any portable code should follow. To display all these characters properly, the editor must recognize that the file is UTF-8, and it must use a font that supports all the characters in the file.
It is also possible to specify a different encoding for source files. In order to do this, we put the below code on top of our code !
# -*- coding: encoding -*-
https://docs.python.org/dev/tutorial/interpreter.html

Unicode error when printing from Python to Heroku logs

I have a python script that's running periodically on Heroku using their Scheduler add-on. It prints some debug info, but when there's a non-ASCII character in the text, I get an error in the logs like:
SyntaxError: Non-ASCII character '\xc2' in file send-tweet.py on line 40, but no encoding declared; see http://www.python.org/peps/pep-0263.html for details
That's when I have a line like this in the script:
print u"Unicode test: £ ’ …"
I'm not sure what to do about this. If I have this in the script:
import locale
print u"Encoding: %s" % locale.getdefaultlocale()[1]
then this is output in the logs:
Encoding: UTF-8
So, why is it trying, and failing, to output other text in ASCII?
UPDATE: FWIW, here's the actual script I'm using. The debugging output's in line 38-39.

As the error says:
no encoding declared
i.e there is no encoding declared in your Python source file.
The linked PEP tells you how to declare an encoding in your Python source: the encoding should be set to the table that your editor/IDE uses when you input the unicode character £ from your example. Most likely UTF-8 is assumed, so at the first line of your send-tweet.py put this:
# coding=utf-8
If the first line already contains a path directive like:
#!/usr/local/bin/python
then put the encoding directive on the second line, e.g.
#!/usr/local/bin/python
# coding=utf-8
Also, when writing Unicode characters in your Python source and declaring UTF-8 encoding, you must use an editor with UTF-8 file saving support, i.e. an editor that can serialize Unicode code points to UTF-8.
In this regard, please note that Unicode and UTF-8 are not the same. Unicode refers to the standard, while UTF-8 is a specific encoding that determines how to serialize Unicode code points into a string that is compatible with ASCII and which uses 1 to 4 bytes to represent the original Unicode string.
So in the Python interpreter a string might be stored as Unicode, but if you want to write a Unicode string as UTF-8 you need to explicitly serialize the string to UTF-8 first, e.g.
s.encode("utf-8")
This is important especially when outputting Unicode strings to byte-sized streams, e.g. when writing to a log file handle which typically assumes byte-sized characters, i.e. UTF-8 for content that contains non-ASCII characters.

Writing UTF-8 friendly parsers in python

I wrote a simple file parser and writer, but then I came across an article talking about the importance of unicode and then it occurred to me that I'm assuming the input file is ascii encoded, which may not be the case all the time, though it would be rare in my situation.
In those rare cases, I would expect UTF-8 encoded files.
Is there a way to work with UTF-8 files by simply changing how I read and write? All I do with the strings is store them and then write them out, so I just need to make sure I can read them, store them, and write them properly.
Furthermore, would I have to treat ascii and UTF-8 files separately and write different functions for each? I have not worked with anything other than ascii files yet and only read about handling unicode.

Python natively supports Unicode. If you directly read and write from the first file to the second, then no data is lost as it copies the bytes verbatim. However, if you decode the string and then re-encode it, you'll need to make sure you use the right encoding.

If you are using Python 2, you can simply change all your str objects to unicode objects. Unicode objects have all the same methods as strings but are encoded in a unicode format instead of ASCII. See http://docs.python.org/library/functions.html#unicode .
If you are using Python 3, strings are encoded in UTF-8 by default.

If you are using Python 2.6 or later, you can use the io library and its io.open method to open the files you want. It has an encoding argument which should be set to 'utf-8' in your case. When you read or write the returned file objects, string are automatically en-/decoded.
Anyway, you don't need to do something special for ASCII, because UTF-8 is a superset of ASCII.

So long as you are only reading and writing to files and not expecting any other type of encoded input, then you should not have to do anything special.
% cat /tmp/u
π is 3.14.
% file /tmp/u
/tmp/u: UTF-8 Unicode text
% cat f.py
f = open('/tmp/u', 'r')
d = f.read()
print d.split()
f.close()
% python f.py
['\xcf\x80', 'is', '3.14.']
This changes when you declare or accept standard input using UTF-8.
% cat g.py
s = 'π is 3.14.'
print s.split()
% python g.py
File "g.py", line 1
SyntaxError: Non-ASCII character '\xcf' in file g.py on line 1, but no encoding declared; see http://www.python.org/peps/pep-0263.html for details
To handle this properly, declare the encoding for the Python program at the beginning per PEP 263 (referenced by the SyntaxError exception above).
% cat h.py
# -*- coding: utf-8 -*-
s = 'π is 3.14.'
print s.split()
% python h.py
['\xcf\x80', 'is', '3.14.']

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.