Unicode output in Python's stdout when running from cmd.exe [duplicate]

Unicode output in Python's stdout when running from cmd.exe [duplicate] - python

This question already has answers here:
Python, Unicode, and the Windows console
(15 answers)
Closed 9 years ago.
I am running Windows 7 and its console has been configured to use Consolas font, which gives me a possibility of Unicode output. The ability to read Unicode in console has been proved by me many times for programs such as Far Manager: both Cyrillics and German äöü letters can be read on the same console in the same string without encoding switching.
Now about Python.
I am trying very hard, but can't see Unicode in it's output.
By default print(sys.stdout.encoding) prints cp866 and stdout is unable to output any characters except ASCII and Cyrillics.
It gives me following results:
print("Ля-ля äöüÄÖÜß")
UnicodeEncodeError: 'charmap' codec can't encode characters in position 6-12: character maps to <undefined>
print("Ля-ля äöüÄÖÜß".encode("utf-8"))
b'\xd0\x9b\xd1\x8f-\xd0\xbb\xd1\x8f \xc3\xa4\xc3\xb6\xc3\xbc\xc3\x84\xc3\x96\xc3\x9c\xc3\x9f'
Ok, I've set the PYTHONIOENCODING environment variable in batch file:
SET PYTHONIOENCODING=UTF-8
and got:
print(sys.stdout.encoding)
UTF-8
print("Ля-ля äöüÄÖÜß")
╨Ы╤П-╨╗╤П ├д├╢├╝├Д├Ц├Ь├Я
print("Ля-ля äöüÄÖÜß".encode("utf-8"))`
b'\xd0\x9b\xd1\x8f-\xd0\xbb\xd1\x8f \xc3\xa4\xc3\xb6\xc3\xbc\xc3\x84\xc3\x96\xc3\x9c\xc3\x9f'
What to do?

Actually, there's a kinda bug in interaction between Python and Windows console (see http://bugs.python.org/issue1602). It is possible to read and write Unicode in Windows console using C functions ReadConsoleW, WriteConsoleW instead of ReadConsole and WriteConsole. So one seems-to-be-working solution is to write your own stdout and stdin object, calling ReadConsoleW, WriteConsoleW via ctypes. For output this works, but for input there's a problem that Python interactive interpreter actually doesn't use sys.stdin for getting input (but calling input() function works) – see http://bugs.python.org/issue17620.
Many people say that there's a problem with Windows console. But you can actually type Unicode characters (if you have proper keyboard layout) with no problem. These are displayed with no problem. You can even run file called “∫.py” with some Unicode arguments and it is correctly run and arguments are correclty waiting in sys.argv strings.
Update: I have built a Python package to deal with these issues. See https://github.com/Drekin/win-unicode-console and https://pypi.python.org/pypi/win_unicode_console. Install by pip install win_unicode_console. It works at least for me on Python 3.4, Python 3.5, and Python 2.7.

Related

Python3 utf-8 decode issue

The following code runs fine with Python3 on my Windows machine and prints the character 'é':
data = b"\xc3\xa9"
print(data.decode('utf-8'))
However, running the same on an Ubuntu based docker container results in :
UnicodeEncodeError: 'ascii' codec can't encode character '\xe9' in position 0: ordinal not in range(128)
Is there anything that I have to install to enable utf-8 decoding ?

Seems ubuntu - depending on version - uses one encoding or another as default, and it may vary between shell and python as well. Adopted from this posting and also this blog:
Thus the recommended way seems to be to tell your python instance to use utf-8 as default encoding:
Set your default encoding of python source files via environment variable:
export PYTHONIOENCODING=utf8
Also, in your source files you can state the encoding you prefer to be used explicitly, so it should work irrespective of environment setting (see this question + answer, python docs and PEP 263:
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
....
Concerning the interpretation of encoding of files read by python, you can specify it explicitly in the open command
with open(fname, "rt", encoding="utf-8") as f:
...
and there's a more hackish way with some side effects, but saves you to explicitly specify it each time
import sys
# sys.setdefaultencoding() does not exist, here!
reload(sys) # Reload does the trick!
sys.setdefaultencoding('UTF8')
Please read the warnings about this hack in the related answer and comments.

The problem is with the print() expression, not with the decode() method.
If you look closely, the raised exception is a UnicodeEncodeError, not a -DecodeError.
Whenever you use the print() function, Python converts its arguments to a str and subsequently encodes the result to bytes, which are sent to the terminal (or whatever Python is run in).
The codec which is used for encoding (eg. UTF-8 or ASCII) depends on the environment.
In an ideal case,
the codec which Python uses is compatible with the one which the terminal expects, so the characters are displayed correctly (otherwise you get mojibake like "Ã©" instead of "é");
the codec used covers a range of characters that is sufficient for your needs (such as UTF-8 or UTF-16, which contain all characters).
In your case, the second condition isn't met for the Linux docker you mention: the encoding used is ASCII, which only supports characters found on an old English typewriter.
These are a few options to address this problem:
Set environment variables: on Linux, Python's encoding defaults depend on this (at least partially). In my experience, this is a bit of a trial and error; setting LC_ALL to something containing "UTF-8" worked for me once. You'll have to put them in start-up script for the shell your terminal runs, eg. .bashrc.
Re-encode STDOUT, like so:
sys.stdout = open(sys.stdout.buffer.fileno(), 'w', encoding='utf8')
The encoding used has to match the one of the terminal.
Encode the strings yourself and send them to the binary buffer underlying sys.stdout, eg. sys.stdout.buffer.write("é".encode('utf8')). This is of course much more boilerplate than print("é"). Again, the encoding used has to match the one of the terminal.
Avoid print() altogether. Use open(fn, encoding=...) for output, the logging module for progress info – depending on how interactive your script is, this might be worthwhile (admittedly, you'll probably face the same encoding problem when writing to STDERR with the logging module).
There might be other options, but I doubt that there are nicer ones.

Where does Python get the preferred encoding from? [duplicate]

When I try to print a Unicode string in a Windows console, I get an error .
UnicodeEncodeError: 'charmap' codec can't encode character ....
I assume this is because the Windows console does not accept Unicode-only characters. What's the best way around this?
Is there any way I can make Python automatically print a ? instead of failing in this situation?
Edit: I'm using Python 2.5.
Note: #LasseV.Karlsen answer with the checkmark is sort of outdated (from 2008). Please use the solutions/answers/suggestions below with care!!
#JFSebastian answer is more relevant as of today (6 Jan 2016).

Update: Python 3.6 implements PEP 528: Change Windows console encoding to UTF-8: the default console on Windows will now accept all Unicode characters. Internally, it uses the same Unicode API as the win-unicode-console package mentioned below. print(unicode_string) should just work now.
I get a UnicodeEncodeError: 'charmap' codec can't encode character... error.
The error means that Unicode characters that you are trying to print can't be represented using the current (chcp) console character encoding. The codepage is often 8-bit encoding such as cp437 that can represent only ~0x100 characters from ~1M Unicode characters:
>>> u"\N{EURO SIGN}".encode('cp437')
Traceback (most recent call last):
...
UnicodeEncodeError: 'charmap' codec can't encode character '\u20ac' in position 0:
character maps to
I assume this is because the Windows console does not accept Unicode-only characters. What's the best way around this?
Windows console does accept Unicode characters and it can even display them (BMP only) if the corresponding font is configured. WriteConsoleW() API should be used as suggested in #Daira Hopwood's answer. It can be called transparently i.e., you don't need to and should not modify your scripts if you use win-unicode-console package:
T:\> py -m pip install win-unicode-console
T:\> py -m run your_script.py
See What's the deal with Python 3.4, Unicode, different languages and Windows?
Is there any way I can make Python
automatically print a ? instead of failing in this situation?
If it is enough to replace all unencodable characters with ? in your case then you could set PYTHONIOENCODING envvar:
T:\> set PYTHONIOENCODING=:replace
T:\> python3 -c "print(u'[\N{EURO SIGN}]')"
[?]
In Python 3.6+, the encoding specified by PYTHONIOENCODING envvar is ignored for interactive console buffers unless PYTHONLEGACYWINDOWSIOENCODING envvar is set to a non-empty string.

Note: This answer is sort of outdated (from 2008). Please use the solution below with care!!
Here is a page that details the problem and a solution (search the page for the text Wrapping sys.stdout into an instance):
PrintFails - Python Wiki
Here's a code excerpt from that page:
$ python -c 'import sys, codecs, locale; print sys.stdout.encoding; \
sys.stdout = codecs.getwriter(locale.getpreferredencoding())(sys.stdout); \
line = u"\u0411\n"; print type(line), len(line); \
sys.stdout.write(line); print line'
UTF-8
<type 'unicode'> 2
Б
Б
$ python -c 'import sys, codecs, locale; print sys.stdout.encoding; \
sys.stdout = codecs.getwriter(locale.getpreferredencoding())(sys.stdout); \
line = u"\u0411\n"; print type(line), len(line); \
sys.stdout.write(line); print line' | cat
None
<type 'unicode'> 2
Б
Б
There's some more information on that page, well worth a read.

Update: On Python 3.6 or later, printing Unicode strings to the console on Windows just works.
So, upgrade to recent Python and you're done. At this point I recommend using 2to3 to update your code to Python 3.x if needed, and just dropping support for Python 2.x. Note that there has been no security support for any version of Python before 3.7 (including Python 2.7) since December 2021.
If you really still need to support earlier versions of Python (including Python 2.7), you can use https://github.com/Drekin/win-unicode-console , which is based on, and uses the same APIs as the code in the answer that was previously linked here. (That link does include some information on Windows font configuration but I doubt it still applies to Windows 8 or later.)
Note: despite other plausible-sounding answers that suggest changing the code page to 65001, that did not work prior to Python 3.8. (It does kind-of work since then, but as pointed out above, you don't need to do so for Python 3.6+ anyway.) Also, changing the default encoding using sys.setdefaultencoding is (still) not a good idea.

If you're not interested in getting a reliable representation of the bad character(s) you might use something like this (working with python >= 2.6, including 3.x):
from __future__ import print_function
import sys
def safeprint(s):
try:
print(s)
except UnicodeEncodeError:
if sys.version_info >= (3,):
print(s.encode('utf8').decode(sys.stdout.encoding))
else:
print(s.encode('utf8'))
safeprint(u"\N{EM DASH}")
The bad character(s) in the string will be converted in a representation which is printable by the Windows console.

The below code will make Python output to console as UTF-8 even on Windows.
The console will display the characters well on Windows 7 but on Windows XP it will not display them well, but at least it will work and most important you will have a consistent output from your script on all platforms. You'll be able to redirect the output to a file.
Below code was tested with Python 2.6 on Windows.
#!/usr/bin/python
# -*- coding: UTF-8 -*-
import codecs, sys
reload(sys)
sys.setdefaultencoding('utf-8')
print sys.getdefaultencoding()
if sys.platform == 'win32':
try:
import win32console
except:
print "Python Win32 Extensions module is required.\n You can download it from https://sourceforge.net/projects/pywin32/ (x86 and x64 builds are available)\n"
exit(-1)
# win32console implementation of SetConsoleCP does not return a value
# CP_UTF8 = 65001
win32console.SetConsoleCP(65001)
if (win32console.GetConsoleCP() != 65001):
raise Exception ("Cannot set console codepage to 65001 (UTF-8)")
win32console.SetConsoleOutputCP(65001)
if (win32console.GetConsoleOutputCP() != 65001):
raise Exception ("Cannot set console output codepage to 65001 (UTF-8)")
#import sys, codecs
sys.stdout = codecs.getwriter('utf8')(sys.stdout)
sys.stderr = codecs.getwriter('utf8')(sys.stderr)
print "This is an Е乂αmp١ȅ testing Unicode support using Arabic, Latin, Cyrillic, Greek, Hebrew and CJK code points.\n"

Just enter this code in command line before executing python script:
chcp 65001 & set PYTHONIOENCODING=utf-8

Like Giampaolo Rodolà's answer, but even more dirty: I really, really intend to spend a long time (soon) understanding the whole subject of encodings and how they apply to Windoze consoles,
For the moment I just wanted sthg which would mean my program would NOT CRASH, and which I understood ... and also which didn't involve importing too many exotic modules (in particular I'm using Jython, so half the time a Python module turns out not in fact to be available).
def pr(s):
try:
print(s)
except UnicodeEncodeError:
for c in s:
try:
print( c, end='')
except UnicodeEncodeError:
print( '?', end='')
NB "pr" is shorter to type than "print" (and quite a bit shorter to type than "safeprint")...!

Kind of related on the answer by J. F. Sebastian, but more direct.
If you are having this problem when printing to the console/terminal, then do this:
>set PYTHONIOENCODING=UTF-8

For Python 2 try:
print unicode(string, 'unicode-escape')
For Python 3 try:
import os
string = "002 Could've Would've Should've"
os.system('echo ' + string)
Or try win-unicode-console:
pip install win-unicode-console
py -mrun your_script.py

TL;DR:
print(yourstring.encode('ascii','replace').decode('ascii'))
I ran into this myself, working on a Twitch chat (IRC) bot. (Python 2.7 latest)
I wanted to parse chat messages in order to respond...
msg = s.recv(1024).decode("utf-8")
but also print them safely to the console in a human-readable format:
print(msg.encode('ascii','replace').decode('ascii'))
This corrected the issue of the bot throwing UnicodeEncodeError: 'charmap' errors and replaced the unicode characters with ?.

Python 3.6 windows7: There is several way to launch a python you could use the python console (which has a python logo on it) or the windows console (it's written cmd.exe on it).
I could not print utf8 characters in the windows console. Printing utf-8 characters throw me this error:
OSError: [winError 87] The paraneter is incorrect
Exception ignored in: (_io-TextIOwrapper name='(stdout)' mode='w' ' encoding='utf8')
OSError: [WinError 87] The parameter is incorrect
After trying and failing to understand the answer above I discovered it was only a setting problem. Right click on the top of the cmd console windows, on the tab font chose lucida console.

The cause of your problem is NOT the Win console not willing to accept Unicode (as it does this since I guess Win2k by default). It is the default system encoding. Try this code and see what it gives you:
import sys
sys.getdefaultencoding()
if it says ascii, there's your cause ;-)
You have to create a file called sitecustomize.py and put it under python path (I put it under /usr/lib/python2.5/site-packages, but that is differen on Win - it is c:\python\lib\site-packages or something), with the following contents:
import sys
sys.setdefaultencoding('utf-8')
and perhaps you might want to specify the encoding in your files as well:
# -*- coding: UTF-8 -*-
import sys,time
Edit: more info can be found in excellent the Dive into Python book

Nowadays, the Windows console does not encounter this error, unless you redirect the output.
Here is an example Python script scratch_1.py:
s = "∞"
print(s)
If you run the script as follows, everything works as intended:
python scratch_1.py
∞
However, if you run the following, then you get the same error as in the question:
python scratch_1.py > temp.txt
Traceback (most recent call last):
File "C:\Users\Wok\AppData\Roaming\JetBrains\PyCharmCE2022.2\scratches\scratch_1.py", line 3, in <module>
print(s)
File "C:\Users\Wok\AppData\Local\Programs\Python\Python311\Lib\encodings\cp1252.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
UnicodeEncodeError: 'charmap' codec can't encode character '\u221e' in position 0: character maps to <undefined>
To solve this issue with the suggestion present in the original question, i.e. by replacing the erroneous characters with question marks ?, one can proceed as follows:
s = "∞"
try:
print(s)
except UnicodeEncodeError:
output_str = s.encode("ascii", errors="replace").decode("ascii")
print(output_str)
It is important:
to call decode(), so that the type of the output is str instead of bytes,
with the same encoding, here "ascii", to avoid the creation of mojibake.

James Sulak asked,
Is there any way I can make Python automatically print a ? instead of failing in this situation?
Other solutions recommend we attempt to modify the Windows environment or replace Python's print() function. The answer below comes closer to fulfilling Sulak's request.
Under Windows 7, Python 3.5 can be made to print Unicode without throwing a UnicodeEncodeError as follows:
In place of:
print(text)
substitute:
print(str(text).encode('utf-8'))
Instead of throwing an exception, Python now displays unprintable Unicode characters as \xNN hex codes, e.g.:
Halmalo n\xe2\x80\x99\xc3\xa9tait plus qu\xe2\x80\x99un point noir
Instead of
Halmalo n’était plus qu’un point noir
Granted, the latter is preferable ceteris paribus, but otherwise the former is completely accurate for diagnostic messages. Because it displays Unicode as literal byte values the former may also assist in diagnosing encode/decode problems.
Note: The str() call above is needed because otherwise encode() causes Python to reject a Unicode character as a tuple of numbers.

The issue is with windows default encoding being set to cp1252, and need to be set to utf-8. (check PEP)
Check default encoding using:
import locale
locale.getpreferredencoding()
You can override locale settings
import os
if os.name == "nt":
import _locale
_locale._gdl_bak = _locale._getdefaultlocale
_locale._getdefaultlocale = (lambda *args: (_locale._gdl_bak()[0], 'utf8'))
referenced code from stack link

chcp 65001 codepage results in program termination without any error

Problem
The problem arises when I want to input Unicode character in Python interpreter (for simplicity I have used a-umlaut in the example, but I have first encountered this for Farsi characters). Whenever I use python with chcp 65001 code page and then try to input even one Unicode character, Python exits without any error.
I have spent days trying to solve this problem to no avail. But today, I found a thread on python website, another on MySQL and another on Lua-users which issues were raised regarding this sudden exit, although without any solution and some saying that chcp 65001 is inherently broken.
It would be good to know once and for all whether this problem is chcp-design-related or there is a possible workaround.
Reproduce Error
chcp 65001
Python 3.X:
Python shell
print('ä')
result: it just exits the shell
however, this works python.exe -c "print('ä')"
and also this : print('\u00e4')
result: ä
in Luajit2.0.4
print('ä')
result: it just exits the shell
however this works: print('\xc3\xa4')
I have come up with this observation so far:
direct output with the command prompt works.
Unicode-based , hex-based equivalent of the character works.
So
This is not a Python bug and that we can't use a Unicode character directly in CLI programs in Windows command prompt or any of its Wrapper like Conemu, Cmder (I am using Cmder to be able to see and use Unicode character in Windows shell and I have done so without any problem). Is this correct?

To use Unicode in the Windows console for Python 2.7 and 3.x (prior to 3.6), install and enable win_unicode_console. This uses the wide-character functions ReadConsoleW and WriteConsoleW, just like other Unicode-aware console programs such as cmd.exe and powershell.exe. For Python 3.6, a new io._WindowsConsoleIO raw I/O class has been added. It reads and writes UTF-8 encoded text (for cross-platform compatibility with Unix -- "get a byte" -- programs), but internally it uses the wide-character API by transcoding to and from UTF-16LE.
The problem you're experiencing with non-ASCII input is reproducible in the console for all Windows versions up to and including Windows 10. The console host process, i.e. conhost.exe, wasn't designed for UTF-8 (codepage 65001) and hasn't been updated to support it consistently. In particular, non-ASCII input causes an empty read. This in turn causes Python's REPL to exit and built-in input to raise EOFError.
The problem is that conhost encodes its UTF-16 input buffer assuming a single-byte codepage, such as the OEM and ANSI codepages in Western locales (e.g. 437, 850, 1252). UTF-8 is a multibyte encoding in which non-ASCII characters are encoded as 2 to 4 bytes. To handle UTF-8 it would need to encode in multiple iterations of M / 4 characters, where M is the remaining bytes available from the N-byte buffer. Instead it assumes a request to read N bytes is a request to read N characters. Then if the input has one or more non-ASCII characters, the internal WideCharToMultiByte call fails due to an undersized buffer, and the console returns a 'successful' read of 0 bytes.
You may not observe exactly this problem in Python 3.5 if the pyreadline module is installed. Python 3.5 automatically tries to import readline. In the case of pyreadline, input is read via the wide-character function ReadConsoleInputW. This is a low-level function to read console input records. In principle it should work, but in practice entering print('ä') gets read by the REPL as print(''). For a non-ASCII character, ReadConsoleInputW returns a sequence of Alt+Numpad KEY_EVENT records. The sequence is a lossy OEM encoding, which can be ignored except for the last record, which has the input character in the UnicodeChar field. Apparently pyreadline ignores the entire sequence.
Prior to Windows 8, output using codepage 65001 is also broken. It prints a trail of garbage text in proportion to the number of non-ASCII characters. In this case the problem is that WriteFile and WriteConsoleA incorrectly return the number of UTF-16 codes written to the screen buffer instead of the number of UTF-8 bytes. This confuses Python's buffered writer, leading to repeated writes of what it thinks are the remaining unwritten bytes. This problem was fixed in Windows 8 as part of rewriting the internal console API to use the ConDrv device instead of an LPC port. Older versions of Windows can use ConEmu or ANSICON to work around this bug.

Convert unicode escape sequence into chinease characters string [duplicate]

When I try to print a Unicode string in a Windows console, I get an error .
UnicodeEncodeError: 'charmap' codec can't encode character ....
I assume this is because the Windows console does not accept Unicode-only characters. What's the best way around this?
Is there any way I can make Python automatically print a ? instead of failing in this situation?
Edit: I'm using Python 2.5.
Note: #LasseV.Karlsen answer with the checkmark is sort of outdated (from 2008). Please use the solutions/answers/suggestions below with care!!
#JFSebastian answer is more relevant as of today (6 Jan 2016).

Update: Python 3.6 implements PEP 528: Change Windows console encoding to UTF-8: the default console on Windows will now accept all Unicode characters. Internally, it uses the same Unicode API as the win-unicode-console package mentioned below. print(unicode_string) should just work now.
I get a UnicodeEncodeError: 'charmap' codec can't encode character... error.
The error means that Unicode characters that you are trying to print can't be represented using the current (chcp) console character encoding. The codepage is often 8-bit encoding such as cp437 that can represent only ~0x100 characters from ~1M Unicode characters:
>>> u"\N{EURO SIGN}".encode('cp437')
Traceback (most recent call last):
...
UnicodeEncodeError: 'charmap' codec can't encode character '\u20ac' in position 0:
character maps to
I assume this is because the Windows console does not accept Unicode-only characters. What's the best way around this?
Windows console does accept Unicode characters and it can even display them (BMP only) if the corresponding font is configured. WriteConsoleW() API should be used as suggested in #Daira Hopwood's answer. It can be called transparently i.e., you don't need to and should not modify your scripts if you use win-unicode-console package:
T:\> py -m pip install win-unicode-console
T:\> py -m run your_script.py
See What's the deal with Python 3.4, Unicode, different languages and Windows?
Is there any way I can make Python
automatically print a ? instead of failing in this situation?
If it is enough to replace all unencodable characters with ? in your case then you could set PYTHONIOENCODING envvar:
T:\> set PYTHONIOENCODING=:replace
T:\> python3 -c "print(u'[\N{EURO SIGN}]')"
[?]
In Python 3.6+, the encoding specified by PYTHONIOENCODING envvar is ignored for interactive console buffers unless PYTHONLEGACYWINDOWSIOENCODING envvar is set to a non-empty string.

Note: This answer is sort of outdated (from 2008). Please use the solution below with care!!
Here is a page that details the problem and a solution (search the page for the text Wrapping sys.stdout into an instance):
PrintFails - Python Wiki
Here's a code excerpt from that page:
$ python -c 'import sys, codecs, locale; print sys.stdout.encoding; \
sys.stdout = codecs.getwriter(locale.getpreferredencoding())(sys.stdout); \
line = u"\u0411\n"; print type(line), len(line); \
sys.stdout.write(line); print line'
UTF-8
<type 'unicode'> 2
Б
Б
$ python -c 'import sys, codecs, locale; print sys.stdout.encoding; \
sys.stdout = codecs.getwriter(locale.getpreferredencoding())(sys.stdout); \
line = u"\u0411\n"; print type(line), len(line); \
sys.stdout.write(line); print line' | cat
None
<type 'unicode'> 2
Б
Б
There's some more information on that page, well worth a read.

Update: On Python 3.6 or later, printing Unicode strings to the console on Windows just works.
So, upgrade to recent Python and you're done. At this point I recommend using 2to3 to update your code to Python 3.x if needed, and just dropping support for Python 2.x. Note that there has been no security support for any version of Python before 3.7 (including Python 2.7) since December 2021.
If you really still need to support earlier versions of Python (including Python 2.7), you can use https://github.com/Drekin/win-unicode-console , which is based on, and uses the same APIs as the code in the answer that was previously linked here. (That link does include some information on Windows font configuration but I doubt it still applies to Windows 8 or later.)
Note: despite other plausible-sounding answers that suggest changing the code page to 65001, that did not work prior to Python 3.8. (It does kind-of work since then, but as pointed out above, you don't need to do so for Python 3.6+ anyway.) Also, changing the default encoding using sys.setdefaultencoding is (still) not a good idea.

If you're not interested in getting a reliable representation of the bad character(s) you might use something like this (working with python >= 2.6, including 3.x):
from __future__ import print_function
import sys
def safeprint(s):
try:
print(s)
except UnicodeEncodeError:
if sys.version_info >= (3,):
print(s.encode('utf8').decode(sys.stdout.encoding))
else:
print(s.encode('utf8'))
safeprint(u"\N{EM DASH}")
The bad character(s) in the string will be converted in a representation which is printable by the Windows console.

The below code will make Python output to console as UTF-8 even on Windows.
The console will display the characters well on Windows 7 but on Windows XP it will not display them well, but at least it will work and most important you will have a consistent output from your script on all platforms. You'll be able to redirect the output to a file.
Below code was tested with Python 2.6 on Windows.
#!/usr/bin/python
# -*- coding: UTF-8 -*-
import codecs, sys
reload(sys)
sys.setdefaultencoding('utf-8')
print sys.getdefaultencoding()
if sys.platform == 'win32':
try:
import win32console
except:
print "Python Win32 Extensions module is required.\n You can download it from https://sourceforge.net/projects/pywin32/ (x86 and x64 builds are available)\n"
exit(-1)
# win32console implementation of SetConsoleCP does not return a value
# CP_UTF8 = 65001
win32console.SetConsoleCP(65001)
if (win32console.GetConsoleCP() != 65001):
raise Exception ("Cannot set console codepage to 65001 (UTF-8)")
win32console.SetConsoleOutputCP(65001)
if (win32console.GetConsoleOutputCP() != 65001):
raise Exception ("Cannot set console output codepage to 65001 (UTF-8)")
#import sys, codecs
sys.stdout = codecs.getwriter('utf8')(sys.stdout)
sys.stderr = codecs.getwriter('utf8')(sys.stderr)
print "This is an Е乂αmp١ȅ testing Unicode support using Arabic, Latin, Cyrillic, Greek, Hebrew and CJK code points.\n"

Just enter this code in command line before executing python script:
chcp 65001 & set PYTHONIOENCODING=utf-8

Like Giampaolo Rodolà's answer, but even more dirty: I really, really intend to spend a long time (soon) understanding the whole subject of encodings and how they apply to Windoze consoles,
For the moment I just wanted sthg which would mean my program would NOT CRASH, and which I understood ... and also which didn't involve importing too many exotic modules (in particular I'm using Jython, so half the time a Python module turns out not in fact to be available).
def pr(s):
try:
print(s)
except UnicodeEncodeError:
for c in s:
try:
print( c, end='')
except UnicodeEncodeError:
print( '?', end='')
NB "pr" is shorter to type than "print" (and quite a bit shorter to type than "safeprint")...!

Kind of related on the answer by J. F. Sebastian, but more direct.
If you are having this problem when printing to the console/terminal, then do this:
>set PYTHONIOENCODING=UTF-8

For Python 2 try:
print unicode(string, 'unicode-escape')
For Python 3 try:
import os
string = "002 Could've Would've Should've"
os.system('echo ' + string)
Or try win-unicode-console:
pip install win-unicode-console
py -mrun your_script.py

TL;DR:
print(yourstring.encode('ascii','replace').decode('ascii'))
I ran into this myself, working on a Twitch chat (IRC) bot. (Python 2.7 latest)
I wanted to parse chat messages in order to respond...
msg = s.recv(1024).decode("utf-8")
but also print them safely to the console in a human-readable format:
print(msg.encode('ascii','replace').decode('ascii'))
This corrected the issue of the bot throwing UnicodeEncodeError: 'charmap' errors and replaced the unicode characters with ?.

Python 3.6 windows7: There is several way to launch a python you could use the python console (which has a python logo on it) or the windows console (it's written cmd.exe on it).
I could not print utf8 characters in the windows console. Printing utf-8 characters throw me this error:
OSError: [winError 87] The paraneter is incorrect
Exception ignored in: (_io-TextIOwrapper name='(stdout)' mode='w' ' encoding='utf8')
OSError: [WinError 87] The parameter is incorrect
After trying and failing to understand the answer above I discovered it was only a setting problem. Right click on the top of the cmd console windows, on the tab font chose lucida console.

The cause of your problem is NOT the Win console not willing to accept Unicode (as it does this since I guess Win2k by default). It is the default system encoding. Try this code and see what it gives you:
import sys
sys.getdefaultencoding()
if it says ascii, there's your cause ;-)
You have to create a file called sitecustomize.py and put it under python path (I put it under /usr/lib/python2.5/site-packages, but that is differen on Win - it is c:\python\lib\site-packages or something), with the following contents:
import sys
sys.setdefaultencoding('utf-8')
and perhaps you might want to specify the encoding in your files as well:
# -*- coding: UTF-8 -*-
import sys,time
Edit: more info can be found in excellent the Dive into Python book

Nowadays, the Windows console does not encounter this error, unless you redirect the output.
Here is an example Python script scratch_1.py:
s = "∞"
print(s)
If you run the script as follows, everything works as intended:
python scratch_1.py
∞
However, if you run the following, then you get the same error as in the question:
python scratch_1.py > temp.txt
Traceback (most recent call last):
File "C:\Users\Wok\AppData\Roaming\JetBrains\PyCharmCE2022.2\scratches\scratch_1.py", line 3, in <module>
print(s)
File "C:\Users\Wok\AppData\Local\Programs\Python\Python311\Lib\encodings\cp1252.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
UnicodeEncodeError: 'charmap' codec can't encode character '\u221e' in position 0: character maps to <undefined>
To solve this issue with the suggestion present in the original question, i.e. by replacing the erroneous characters with question marks ?, one can proceed as follows:
s = "∞"
try:
print(s)
except UnicodeEncodeError:
output_str = s.encode("ascii", errors="replace").decode("ascii")
print(output_str)
It is important:
to call decode(), so that the type of the output is str instead of bytes,
with the same encoding, here "ascii", to avoid the creation of mojibake.

James Sulak asked,
Is there any way I can make Python automatically print a ? instead of failing in this situation?
Other solutions recommend we attempt to modify the Windows environment or replace Python's print() function. The answer below comes closer to fulfilling Sulak's request.
Under Windows 7, Python 3.5 can be made to print Unicode without throwing a UnicodeEncodeError as follows:
In place of:
print(text)
substitute:
print(str(text).encode('utf-8'))
Instead of throwing an exception, Python now displays unprintable Unicode characters as \xNN hex codes, e.g.:
Halmalo n\xe2\x80\x99\xc3\xa9tait plus qu\xe2\x80\x99un point noir
Instead of
Halmalo n’était plus qu’un point noir
Granted, the latter is preferable ceteris paribus, but otherwise the former is completely accurate for diagnostic messages. Because it displays Unicode as literal byte values the former may also assist in diagnosing encode/decode problems.
Note: The str() call above is needed because otherwise encode() causes Python to reject a Unicode character as a tuple of numbers.

The issue is with windows default encoding being set to cp1252, and need to be set to utf-8. (check PEP)
Check default encoding using:
import locale
locale.getpreferredencoding()
You can override locale settings
import os
if os.name == "nt":
import _locale
_locale._gdl_bak = _locale._getdefaultlocale
_locale._getdefaultlocale = (lambda *args: (_locale._gdl_bak()[0], 'utf8'))
referenced code from stack link

UnicodeEncodeError in Python on Windows Console

I'm having the following error while recursing the files in a directory and printing file names in the console:
Traceback (most recent call last):
File "C:\Program Files\Python33\lib\encodings\cp437.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_map)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u2013' in position
53: character maps to <undefined>
According to the error, one of the characters in the file name string is \u2013 which is an EN DASH – character different from the commonly seen - minus character.
I have checked my Windows encoding which is set to 437. Now, I see that I have two options to workaround this by either changing the encoding of Windows console or convert the characters in get from the file names to suit the console encoding. How would I go do that in Python 3.3?

Windows console is using cp437 encoding and there is a character \u2013 that isn't supported by that encoding. Try adding this to your code:
sys.stdout = io.TextIOWrapper(sys.stdout.buffer,'cp437','backslashreplace')

or convert the characters in get from the file names to suit the console encoding
Probably the console encoding is already correct (can't tell from the error message though). Code page 437 simply doesn't include that character so you won't be able to print it.
You can reopen stdout with a text encoder that has a fallback encoding, as demonstrated in iamsudip's answer which uses backslashreplace, to at least get readable (if not reliably recoverable) output instead of an error.
changing the encoding of Windows console
You can do this by executing the console command chcp 1252 before running Python, but that will still only give you a different limited repertoire of printable characters - including U+2013, but not many other Unicode characters.
In theory you can chcp to 65001 to get UTF-8 which would allow you to print any character. Unfortunately there are serious bugs in the C runtime's standard IO implementation, which usually make this unusable in practice.
This sorry state of affairs affects all applications that use the MS C runtime's stdio library calls, including Python and most other languages, with the result that Unicode on the Windows console just doesn't work in most cases.
If you really have to get Unicode out to the Windows console you can use the Win32 WriteConsoleW API directly using ctypes, but it's not much fun.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Unicode output in Python's stdout when running from cmd.exe [duplicate] - python

Related

Python3 utf-8 decode issue

Where does Python get the preferred encoding from? [duplicate]

chcp 65001 codepage results in program termination without any error

Convert unicode escape sequence into chinease characters string [duplicate]

UnicodeEncodeError in Python on Windows Console

Categories

Resources