UnicodeEncodeError if piping output to wc -l

UnicodeEncodeError if piping output to wc -l - python

When running the code:
#! /usr/bin/env python
# -*- coding: UTF-8 -*-
import xml.etree.ElementTree as ET
print ET.fromstring('<?xml version="1.0" encoding="UTF-8" standalone="yes"?><root><road>vägen</road></root>').find('road').text
Produces the expected output vägen, however if piping this to wc -l I get a UnicodeEncodeError, e.g. (TEerr.py holds the code snippet given above):
:~> ETerr.py | wc -l
Traceback (most recent call last):
File "./ETerr.py", line 5, in <module>
print ET.fromstring('<?xml version="1.0" encoding="UTF-8" standalone="yes"?><root><road>vägen</road></root>').find('road').text
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe4' in position 1: ordinal not in range(128)
0
:~>
How can the code behave differently if its output is piped or not and how can I fix it so that it doesn't.
Please note that the code snippet above is merely set up to demonstrate the issue with as little code as possible, in the actual script where I need to resolve the issue the xml is retrieved using urllib hence I have little control over its format.

First, let me point out that this is not a problem in Python 3, and fixing it is in fact one of the reasons that it was worth a compatibility-breaking change to the language in the first place. But I'll assume you have a good reason for using Python 2, and can't just upgrade.
The proximate cause here (assuming you're using Python 2.7 on a POSIX platform—things can be more complicated on older 2.x, and on Windows) is the value of sys.stdout.encoding. When you start up the interpreter, it does the equivalent of this pseudocode:
if isatty(stdoutfd):
sys.stdout.encoding = parse_locale(os.environ('LC_CTYPE'))
else:
sys.stdout.encoding = None
And every time you write to a file, including sys.stdout, including implicitly from a print statement, it does something like this:
if isinstance(s, unicode):
if self.encoding:
s = s.encode(self.encoding)
else:
s = s.encode(sys.getdefaultencoding())
The actual code does standard POSIX stuff looking for fallbacks like LANG, and hardcodes a fallback to UTF-8 in some cases for Mac OS X, etc., but this is close enough.
This is only sparsely documented, under file.encoding:
The encoding that this file uses. When Unicode strings are written to a file, they will be converted to byte strings using this encoding. In addition, when the file is connected to a terminal, the attribute gives the encoding that the terminal is likely to use (that information might be incorrect if the user has misconfigured the terminal). The attribute is read-only and may not be present on all file-like objects. It may also be None, in which case the file uses the system default encoding for converting Unicode strings.
To verify that this is your problem, try the following:
$ python -c 'print __import__("sys").stdout.encoding'
UTF-8
$ python -c 'print __import__("sys").stdout.encoding' | cat
None
To be extra sure this is the problem:
$ PYTHONIOENCODING=Latin-1 python -c 'print __import__("sys").stdout.encoding'
Latin-1
$ PYTHONIOENCODING=Latin-1 python -c 'print __import__("sys").stdout.encoding' | cat
Latin-1
So, how do you fix this?
Well, the obvious way is to upgrade to Python 3.6, where you'll get UTF-8 in both cases, but I'll assume there's a reason you're using Python 2.7 and can't easily change it.
The right solution is actually pretty complicated. But if you want a quick&dirty solution that works for your system, and for most current Linux and Mac systems with standard Python 2.7 setups (even though it may be disastrously wrong for older Linux systems, older Python 2.x versions, and weird setups), you can either:
Set the environment variable PYTHONIOENCODING to override the detection and force UTF-8. Setting this in your profile or similar may be worth doing if you know that every terminal and every tool you're ever going to use from this account is UTF-8, although it's a terrible idea if that isn't true.
Check sys.stdout.encoding and wrap it with a 'UTF-8' encoding if it's None.
Explicitly .encode('UTF-8') on everything you print.

Related

pprint: UnicodeEncodeError: 'ascii' codec can't encode character

This is driving me crazy. I'm trying to pprint a dict with an é char, and it throws me out.
I'm using Python 3:
from pprint import pprint
knights = {'gallahad': 'the pure', 'robin': 'the bravé'}
pprint (knights)
Error:
File "/data/prod_envs/pythons/python36/lib/python3.6/pprint.py", line 176, in _format
stream.write(rep)
UnicodeEncodeError: 'ascii' codec can't encode character '\xe9' in position 43: ordinal not in range(128)
I read up on the Python ASCII doc, but there does not seem a quick way to solve this, other than taking the dict apart, and rewriting the offending value to an ASCII value via .encode, and then re-assembling the dict again
Is there any way I can get this to print without taking the dict apart?

This is unrelated to pprint: the module only formats the string into another string and then passes the formatted string to the underlying stream. So your error occurs when the é character (U+00E9) is written to stdout.
Now it really depends on the underlying OS and the configuration of the Python interpreter. In Linux or other Unix-like systems, you could try to declare a UTF-8 or Latin1 charset in your terminal session by setting the environment variable PYTHONIOENCODING before starting Python:
$ export PYTHONIOENCODING=Latin1
$ python
(or use PYTHONIOENCODING=utf8 depending on the actual encoding of your terminal or terminal window).

Standard input and output are file objects in Python. The Python 3 documentation says that, when these objects are created, if encoding is left unspecified then locale.getpreferredencoding(False) is called to fetch the locale's preferred encoding.
Your system should have been set up with one or more "locales" when GNU/Linux was installed (I'm guessing from your paths that you are using some version of GNU/Linux). On a "sensible" setup, the default locale should allow UTF-8. But if you only did a "minimal" installation (for example as part of setting up a container), or something like that, then it is possible that the system has set locale to "C" (the ultimate fallback locale), which does not support UTF-8.
Just because your terminal can accept UTF-8 (as demonstrated by using echo with a UTF-8 string), does not mean Python knows that UTF-8 is acceptable. If Python sees the locale set to "C" then it will assume only ASCII is allowed unless told otherwise.
You can check the current locale by typing locale at the shell prompt, and change it by setting the LC_ALL environment variable. But before changing it you must check with locale -a to see which locales are available on your system, otherwise your change may not be effective and you may get the "C" locale anyway. If your system has not been set up with the locale you want, you can add it if you have root access: most GNU/Linux distributions provide options to do this when you (re)configure a package called locales, so for example on Debian/Ubuntu-based distros, sudo dpkg-reconfigure locales should show you the options.
But sometimes you will be in the awkward position of having to write a Python script to run on a system that has not been set up with decent locales and there's nothing you can do about it because you don't have root and the sysadmin insists on giving you the absolute minimum. Then what do we do?
Well there are options within Python itself. You could run export PYTHONIOENCODING=utf-8 before running Python, to tell Python to use that encoding no matter what the locale says. Or you could give pprint a stream= parameter, set to a stream that you've opened yourself using open() with an encoding="utf-8" parameter (although this is no good if you want to use sys.stdout or os.popen instead of a file). Or you could upgrade to Python 3.7 and use sys.stdout.reconfigure(encoding='utf-8') (but this won't work in the Python 3.6 mentioned in the original question).
Or, you could import codecs and do w=codecs.getwriter("utf-8")(sys.stdout.buffer) and then pass stream=w to your pprint:
from pprint import pprint
import sys, codecs
w=codecs.getwriter("utf-8")(sys.stdout.buffer)
d = {"testing": "这是个考验"}
pprint (d, stream=w)

Python3 utf-8 decode issue

The following code runs fine with Python3 on my Windows machine and prints the character 'é':
data = b"\xc3\xa9"
print(data.decode('utf-8'))
However, running the same on an Ubuntu based docker container results in :
UnicodeEncodeError: 'ascii' codec can't encode character '\xe9' in position 0: ordinal not in range(128)
Is there anything that I have to install to enable utf-8 decoding ?

Seems ubuntu - depending on version - uses one encoding or another as default, and it may vary between shell and python as well. Adopted from this posting and also this blog:
Thus the recommended way seems to be to tell your python instance to use utf-8 as default encoding:
Set your default encoding of python source files via environment variable:
export PYTHONIOENCODING=utf8
Also, in your source files you can state the encoding you prefer to be used explicitly, so it should work irrespective of environment setting (see this question + answer, python docs and PEP 263:
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
....
Concerning the interpretation of encoding of files read by python, you can specify it explicitly in the open command
with open(fname, "rt", encoding="utf-8") as f:
...
and there's a more hackish way with some side effects, but saves you to explicitly specify it each time
import sys
# sys.setdefaultencoding() does not exist, here!
reload(sys) # Reload does the trick!
sys.setdefaultencoding('UTF8')
Please read the warnings about this hack in the related answer and comments.

The problem is with the print() expression, not with the decode() method.
If you look closely, the raised exception is a UnicodeEncodeError, not a -DecodeError.
Whenever you use the print() function, Python converts its arguments to a str and subsequently encodes the result to bytes, which are sent to the terminal (or whatever Python is run in).
The codec which is used for encoding (eg. UTF-8 or ASCII) depends on the environment.
In an ideal case,
the codec which Python uses is compatible with the one which the terminal expects, so the characters are displayed correctly (otherwise you get mojibake like "Ã©" instead of "é");
the codec used covers a range of characters that is sufficient for your needs (such as UTF-8 or UTF-16, which contain all characters).
In your case, the second condition isn't met for the Linux docker you mention: the encoding used is ASCII, which only supports characters found on an old English typewriter.
These are a few options to address this problem:
Set environment variables: on Linux, Python's encoding defaults depend on this (at least partially). In my experience, this is a bit of a trial and error; setting LC_ALL to something containing "UTF-8" worked for me once. You'll have to put them in start-up script for the shell your terminal runs, eg. .bashrc.
Re-encode STDOUT, like so:
sys.stdout = open(sys.stdout.buffer.fileno(), 'w', encoding='utf8')
The encoding used has to match the one of the terminal.
Encode the strings yourself and send them to the binary buffer underlying sys.stdout, eg. sys.stdout.buffer.write("é".encode('utf8')). This is of course much more boilerplate than print("é"). Again, the encoding used has to match the one of the terminal.
Avoid print() altogether. Use open(fn, encoding=...) for output, the logging module for progress info – depending on how interactive your script is, this might be worthwhile (admittedly, you'll probably face the same encoding problem when writing to STDERR with the logging module).
There might be other options, but I doubt that there are nicer ones.

Where does Python get the preferred encoding from? [duplicate]

When I try to print a Unicode string in a Windows console, I get an error .
UnicodeEncodeError: 'charmap' codec can't encode character ....
I assume this is because the Windows console does not accept Unicode-only characters. What's the best way around this?
Is there any way I can make Python automatically print a ? instead of failing in this situation?
Edit: I'm using Python 2.5.
Note: #LasseV.Karlsen answer with the checkmark is sort of outdated (from 2008). Please use the solutions/answers/suggestions below with care!!
#JFSebastian answer is more relevant as of today (6 Jan 2016).

Update: Python 3.6 implements PEP 528: Change Windows console encoding to UTF-8: the default console on Windows will now accept all Unicode characters. Internally, it uses the same Unicode API as the win-unicode-console package mentioned below. print(unicode_string) should just work now.
I get a UnicodeEncodeError: 'charmap' codec can't encode character... error.
The error means that Unicode characters that you are trying to print can't be represented using the current (chcp) console character encoding. The codepage is often 8-bit encoding such as cp437 that can represent only ~0x100 characters from ~1M Unicode characters:
>>> u"\N{EURO SIGN}".encode('cp437')
Traceback (most recent call last):
...
UnicodeEncodeError: 'charmap' codec can't encode character '\u20ac' in position 0:
character maps to
I assume this is because the Windows console does not accept Unicode-only characters. What's the best way around this?
Windows console does accept Unicode characters and it can even display them (BMP only) if the corresponding font is configured. WriteConsoleW() API should be used as suggested in #Daira Hopwood's answer. It can be called transparently i.e., you don't need to and should not modify your scripts if you use win-unicode-console package:
T:\> py -m pip install win-unicode-console
T:\> py -m run your_script.py
See What's the deal with Python 3.4, Unicode, different languages and Windows?
Is there any way I can make Python
automatically print a ? instead of failing in this situation?
If it is enough to replace all unencodable characters with ? in your case then you could set PYTHONIOENCODING envvar:
T:\> set PYTHONIOENCODING=:replace
T:\> python3 -c "print(u'[\N{EURO SIGN}]')"
[?]
In Python 3.6+, the encoding specified by PYTHONIOENCODING envvar is ignored for interactive console buffers unless PYTHONLEGACYWINDOWSIOENCODING envvar is set to a non-empty string.

Note: This answer is sort of outdated (from 2008). Please use the solution below with care!!
Here is a page that details the problem and a solution (search the page for the text Wrapping sys.stdout into an instance):
PrintFails - Python Wiki
Here's a code excerpt from that page:
$ python -c 'import sys, codecs, locale; print sys.stdout.encoding; \
sys.stdout = codecs.getwriter(locale.getpreferredencoding())(sys.stdout); \
line = u"\u0411\n"; print type(line), len(line); \
sys.stdout.write(line); print line'
UTF-8
<type 'unicode'> 2
Б
Б
$ python -c 'import sys, codecs, locale; print sys.stdout.encoding; \
sys.stdout = codecs.getwriter(locale.getpreferredencoding())(sys.stdout); \
line = u"\u0411\n"; print type(line), len(line); \
sys.stdout.write(line); print line' | cat
None
<type 'unicode'> 2
Б
Б
There's some more information on that page, well worth a read.

Update: On Python 3.6 or later, printing Unicode strings to the console on Windows just works.
So, upgrade to recent Python and you're done. At this point I recommend using 2to3 to update your code to Python 3.x if needed, and just dropping support for Python 2.x. Note that there has been no security support for any version of Python before 3.7 (including Python 2.7) since December 2021.
If you really still need to support earlier versions of Python (including Python 2.7), you can use https://github.com/Drekin/win-unicode-console , which is based on, and uses the same APIs as the code in the answer that was previously linked here. (That link does include some information on Windows font configuration but I doubt it still applies to Windows 8 or later.)
Note: despite other plausible-sounding answers that suggest changing the code page to 65001, that did not work prior to Python 3.8. (It does kind-of work since then, but as pointed out above, you don't need to do so for Python 3.6+ anyway.) Also, changing the default encoding using sys.setdefaultencoding is (still) not a good idea.

If you're not interested in getting a reliable representation of the bad character(s) you might use something like this (working with python >= 2.6, including 3.x):
from __future__ import print_function
import sys
def safeprint(s):
try:
print(s)
except UnicodeEncodeError:
if sys.version_info >= (3,):
print(s.encode('utf8').decode(sys.stdout.encoding))
else:
print(s.encode('utf8'))
safeprint(u"\N{EM DASH}")
The bad character(s) in the string will be converted in a representation which is printable by the Windows console.

The below code will make Python output to console as UTF-8 even on Windows.
The console will display the characters well on Windows 7 but on Windows XP it will not display them well, but at least it will work and most important you will have a consistent output from your script on all platforms. You'll be able to redirect the output to a file.
Below code was tested with Python 2.6 on Windows.
#!/usr/bin/python
# -*- coding: UTF-8 -*-
import codecs, sys
reload(sys)
sys.setdefaultencoding('utf-8')
print sys.getdefaultencoding()
if sys.platform == 'win32':
try:
import win32console
except:
print "Python Win32 Extensions module is required.\n You can download it from https://sourceforge.net/projects/pywin32/ (x86 and x64 builds are available)\n"
exit(-1)
# win32console implementation of SetConsoleCP does not return a value
# CP_UTF8 = 65001
win32console.SetConsoleCP(65001)
if (win32console.GetConsoleCP() != 65001):
raise Exception ("Cannot set console codepage to 65001 (UTF-8)")
win32console.SetConsoleOutputCP(65001)
if (win32console.GetConsoleOutputCP() != 65001):
raise Exception ("Cannot set console output codepage to 65001 (UTF-8)")
#import sys, codecs
sys.stdout = codecs.getwriter('utf8')(sys.stdout)
sys.stderr = codecs.getwriter('utf8')(sys.stderr)
print "This is an Е乂αmp١ȅ testing Unicode support using Arabic, Latin, Cyrillic, Greek, Hebrew and CJK code points.\n"

Just enter this code in command line before executing python script:
chcp 65001 & set PYTHONIOENCODING=utf-8

Like Giampaolo Rodolà's answer, but even more dirty: I really, really intend to spend a long time (soon) understanding the whole subject of encodings and how they apply to Windoze consoles,
For the moment I just wanted sthg which would mean my program would NOT CRASH, and which I understood ... and also which didn't involve importing too many exotic modules (in particular I'm using Jython, so half the time a Python module turns out not in fact to be available).
def pr(s):
try:
print(s)
except UnicodeEncodeError:
for c in s:
try:
print( c, end='')
except UnicodeEncodeError:
print( '?', end='')
NB "pr" is shorter to type than "print" (and quite a bit shorter to type than "safeprint")...!

Kind of related on the answer by J. F. Sebastian, but more direct.
If you are having this problem when printing to the console/terminal, then do this:
>set PYTHONIOENCODING=UTF-8

For Python 2 try:
print unicode(string, 'unicode-escape')
For Python 3 try:
import os
string = "002 Could've Would've Should've"
os.system('echo ' + string)
Or try win-unicode-console:
pip install win-unicode-console
py -mrun your_script.py

TL;DR:
print(yourstring.encode('ascii','replace').decode('ascii'))
I ran into this myself, working on a Twitch chat (IRC) bot. (Python 2.7 latest)
I wanted to parse chat messages in order to respond...
msg = s.recv(1024).decode("utf-8")
but also print them safely to the console in a human-readable format:
print(msg.encode('ascii','replace').decode('ascii'))
This corrected the issue of the bot throwing UnicodeEncodeError: 'charmap' errors and replaced the unicode characters with ?.

Python 3.6 windows7: There is several way to launch a python you could use the python console (which has a python logo on it) or the windows console (it's written cmd.exe on it).
I could not print utf8 characters in the windows console. Printing utf-8 characters throw me this error:
OSError: [winError 87] The paraneter is incorrect
Exception ignored in: (_io-TextIOwrapper name='(stdout)' mode='w' ' encoding='utf8')
OSError: [WinError 87] The parameter is incorrect
After trying and failing to understand the answer above I discovered it was only a setting problem. Right click on the top of the cmd console windows, on the tab font chose lucida console.

The cause of your problem is NOT the Win console not willing to accept Unicode (as it does this since I guess Win2k by default). It is the default system encoding. Try this code and see what it gives you:
import sys
sys.getdefaultencoding()
if it says ascii, there's your cause ;-)
You have to create a file called sitecustomize.py and put it under python path (I put it under /usr/lib/python2.5/site-packages, but that is differen on Win - it is c:\python\lib\site-packages or something), with the following contents:
import sys
sys.setdefaultencoding('utf-8')
and perhaps you might want to specify the encoding in your files as well:
# -*- coding: UTF-8 -*-
import sys,time
Edit: more info can be found in excellent the Dive into Python book

Nowadays, the Windows console does not encounter this error, unless you redirect the output.
Here is an example Python script scratch_1.py:
s = "∞"
print(s)
If you run the script as follows, everything works as intended:
python scratch_1.py
∞
However, if you run the following, then you get the same error as in the question:
python scratch_1.py > temp.txt
Traceback (most recent call last):
File "C:\Users\Wok\AppData\Roaming\JetBrains\PyCharmCE2022.2\scratches\scratch_1.py", line 3, in <module>
print(s)
File "C:\Users\Wok\AppData\Local\Programs\Python\Python311\Lib\encodings\cp1252.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
UnicodeEncodeError: 'charmap' codec can't encode character '\u221e' in position 0: character maps to <undefined>
To solve this issue with the suggestion present in the original question, i.e. by replacing the erroneous characters with question marks ?, one can proceed as follows:
s = "∞"
try:
print(s)
except UnicodeEncodeError:
output_str = s.encode("ascii", errors="replace").decode("ascii")
print(output_str)
It is important:
to call decode(), so that the type of the output is str instead of bytes,
with the same encoding, here "ascii", to avoid the creation of mojibake.

James Sulak asked,
Is there any way I can make Python automatically print a ? instead of failing in this situation?
Other solutions recommend we attempt to modify the Windows environment or replace Python's print() function. The answer below comes closer to fulfilling Sulak's request.
Under Windows 7, Python 3.5 can be made to print Unicode without throwing a UnicodeEncodeError as follows:
In place of:
print(text)
substitute:
print(str(text).encode('utf-8'))
Instead of throwing an exception, Python now displays unprintable Unicode characters as \xNN hex codes, e.g.:
Halmalo n\xe2\x80\x99\xc3\xa9tait plus qu\xe2\x80\x99un point noir
Instead of
Halmalo n’était plus qu’un point noir
Granted, the latter is preferable ceteris paribus, but otherwise the former is completely accurate for diagnostic messages. Because it displays Unicode as literal byte values the former may also assist in diagnosing encode/decode problems.
Note: The str() call above is needed because otherwise encode() causes Python to reject a Unicode character as a tuple of numbers.

The issue is with windows default encoding being set to cp1252, and need to be set to utf-8. (check PEP)
Check default encoding using:
import locale
locale.getpreferredencoding()
You can override locale settings
import os
if os.name == "nt":
import _locale
_locale._gdl_bak = _locale._getdefaultlocale
_locale._getdefaultlocale = (lambda *args: (_locale._gdl_bak()[0], 'utf8'))
referenced code from stack link

python 2.7 ignores default encoding set in sitecustomize.py when parsing scripts

I’m having problems getting python 2.7 to read scripts containing utf-8 strings; setting the default encoding to utf-8 in sitecustomize.py doesn’t seem to take.
Here’s my sitecustomize.py:
import sys
sys.setdefaultencoding("utf-8")
I can verify that the default encoding has been changed from the command line:
$ /usr/bin/python -c 'import sys; print(sys.getdefaultencoding())'
utf-8
However, when I try to run a script containing a utf-8 string, as in test.py below (containing · at code point U+00b7)…
filename = 'utf-8·filename.txt'
print(filename)
…the default encoding seems to be ignored:
$ /usr/bin/python test.py
File "test.py", line 1
SyntaxError: Non-ASCII character '\xc2' in file test.py on line 1, but
no encoding declared; see http://www.python.org/peps/pep-0263.html for details
Using an encoding declaration, as in test-coding.py below…
# coding=utf-8
filename = 'utf-8·filename.txt'
print(filename)
…does work:
$ /usr/bin/python test-coding.py
utf-8·filename.txt
Unfortunately, the problem’s come up with scripts that are generated and run by another program (the catkin build system’s catkin_make). I can’t manually add encoding declarations to these scripts before catkin_make runs them, giving SyntaxError & check PEP 263. Changing the default encoding seems like the only solution short of going deep under catkin’s hood, or eliminating all non-ascii paths on my system… and setting it in sitecustomize.py should work, but doesn’t.
Any ideas or insights greatly appreciated!

sys.setdefaultencoding("utf-8") is not doing what you think it is doing. It has no effect on how Python parses source files. That's why you are still seeing SyntaxErrors when the source files use non-ascii characters. To eliminate those errors you need to add an encoding declaration at the beginning of the source file, such as
# -*- encoding: utf-8 -*-
Regarding sys.setdefaultencoding:
Do not try to change the default encoding. The default encoding is used when Python does silent conversion between str
and unicode. For example,
Expected Python2 behavior:
In [1]: '€' + u'€'
should raise UnicodeDecodeError because Python tries to convert '€' to unicode by
computing '€'.decode(sys.getdefaultencoding())
If you change the default encoding, you get different behavior:
In [2]: import sys; reload(sys); sys.setdefaultencoding('utf-8')
<module 'sys' (built-in)>
In [3]: '€' + u'€'
u'\u20ac\xe2\x82\xac'
If you change the defaultencoding, then your Python's behavior will be different than just about all other people's expectation of how Python2 should behave.

You cannot set the default encoding for source files. That default is hardcoded, as part of the language specification.
Set the PEP 263 header instead, as the interpreter is instructing you to do. You'll have to fix the Catkin build system, or rewrite the files it produces to include the header. Simply add a first or second line to those files with # coding=utf8, a task easily accomplished with Python.
The system default encoding is only used for implicit encoding and decoding of Unicode and byte string objects in running code. You should not try and change that, as other often relies on the value to not change. The ability to set it was removed entirely from Python 3.

Python, Unicode, and the Windows console

When I try to print a Unicode string in a Windows console, I get an error .
UnicodeEncodeError: 'charmap' codec can't encode character ....
I assume this is because the Windows console does not accept Unicode-only characters. What's the best way around this?
Is there any way I can make Python automatically print a ? instead of failing in this situation?
Edit: I'm using Python 2.5.
Note: #LasseV.Karlsen answer with the checkmark is sort of outdated (from 2008). Please use the solutions/answers/suggestions below with care!!
#JFSebastian answer is more relevant as of today (6 Jan 2016).

Update: Python 3.6 implements PEP 528: Change Windows console encoding to UTF-8: the default console on Windows will now accept all Unicode characters. Internally, it uses the same Unicode API as the win-unicode-console package mentioned below. print(unicode_string) should just work now.
I get a UnicodeEncodeError: 'charmap' codec can't encode character... error.
The error means that Unicode characters that you are trying to print can't be represented using the current (chcp) console character encoding. The codepage is often 8-bit encoding such as cp437 that can represent only ~0x100 characters from ~1M Unicode characters:
>>> u"\N{EURO SIGN}".encode('cp437')
Traceback (most recent call last):
...
UnicodeEncodeError: 'charmap' codec can't encode character '\u20ac' in position 0:
character maps to
I assume this is because the Windows console does not accept Unicode-only characters. What's the best way around this?
Windows console does accept Unicode characters and it can even display them (BMP only) if the corresponding font is configured. WriteConsoleW() API should be used as suggested in #Daira Hopwood's answer. It can be called transparently i.e., you don't need to and should not modify your scripts if you use win-unicode-console package:
T:\> py -m pip install win-unicode-console
T:\> py -m run your_script.py
See What's the deal with Python 3.4, Unicode, different languages and Windows?
Is there any way I can make Python
automatically print a ? instead of failing in this situation?
If it is enough to replace all unencodable characters with ? in your case then you could set PYTHONIOENCODING envvar:
T:\> set PYTHONIOENCODING=:replace
T:\> python3 -c "print(u'[\N{EURO SIGN}]')"
[?]
In Python 3.6+, the encoding specified by PYTHONIOENCODING envvar is ignored for interactive console buffers unless PYTHONLEGACYWINDOWSIOENCODING envvar is set to a non-empty string.

Note: This answer is sort of outdated (from 2008). Please use the solution below with care!!
Here is a page that details the problem and a solution (search the page for the text Wrapping sys.stdout into an instance):
PrintFails - Python Wiki
Here's a code excerpt from that page:
$ python -c 'import sys, codecs, locale; print sys.stdout.encoding; \
sys.stdout = codecs.getwriter(locale.getpreferredencoding())(sys.stdout); \
line = u"\u0411\n"; print type(line), len(line); \
sys.stdout.write(line); print line'
UTF-8
<type 'unicode'> 2
Б
Б
$ python -c 'import sys, codecs, locale; print sys.stdout.encoding; \
sys.stdout = codecs.getwriter(locale.getpreferredencoding())(sys.stdout); \
line = u"\u0411\n"; print type(line), len(line); \
sys.stdout.write(line); print line' | cat
None
<type 'unicode'> 2
Б
Б
There's some more information on that page, well worth a read.

Update: On Python 3.6 or later, printing Unicode strings to the console on Windows just works.
So, upgrade to recent Python and you're done. At this point I recommend using 2to3 to update your code to Python 3.x if needed, and just dropping support for Python 2.x. Note that there has been no security support for any version of Python before 3.7 (including Python 2.7) since December 2021.
If you really still need to support earlier versions of Python (including Python 2.7), you can use https://github.com/Drekin/win-unicode-console , which is based on, and uses the same APIs as the code in the answer that was previously linked here. (That link does include some information on Windows font configuration but I doubt it still applies to Windows 8 or later.)
Note: despite other plausible-sounding answers that suggest changing the code page to 65001, that did not work prior to Python 3.8. (It does kind-of work since then, but as pointed out above, you don't need to do so for Python 3.6+ anyway.) Also, changing the default encoding using sys.setdefaultencoding is (still) not a good idea.

If you're not interested in getting a reliable representation of the bad character(s) you might use something like this (working with python >= 2.6, including 3.x):
from __future__ import print_function
import sys
def safeprint(s):
try:
print(s)
except UnicodeEncodeError:
if sys.version_info >= (3,):
print(s.encode('utf8').decode(sys.stdout.encoding))
else:
print(s.encode('utf8'))
safeprint(u"\N{EM DASH}")
The bad character(s) in the string will be converted in a representation which is printable by the Windows console.

The below code will make Python output to console as UTF-8 even on Windows.
The console will display the characters well on Windows 7 but on Windows XP it will not display them well, but at least it will work and most important you will have a consistent output from your script on all platforms. You'll be able to redirect the output to a file.
Below code was tested with Python 2.6 on Windows.
#!/usr/bin/python
# -*- coding: UTF-8 -*-
import codecs, sys
reload(sys)
sys.setdefaultencoding('utf-8')
print sys.getdefaultencoding()
if sys.platform == 'win32':
try:
import win32console
except:
print "Python Win32 Extensions module is required.\n You can download it from https://sourceforge.net/projects/pywin32/ (x86 and x64 builds are available)\n"
exit(-1)
# win32console implementation of SetConsoleCP does not return a value
# CP_UTF8 = 65001
win32console.SetConsoleCP(65001)
if (win32console.GetConsoleCP() != 65001):
raise Exception ("Cannot set console codepage to 65001 (UTF-8)")
win32console.SetConsoleOutputCP(65001)
if (win32console.GetConsoleOutputCP() != 65001):
raise Exception ("Cannot set console output codepage to 65001 (UTF-8)")
#import sys, codecs
sys.stdout = codecs.getwriter('utf8')(sys.stdout)
sys.stderr = codecs.getwriter('utf8')(sys.stderr)
print "This is an Е乂αmp١ȅ testing Unicode support using Arabic, Latin, Cyrillic, Greek, Hebrew and CJK code points.\n"

Just enter this code in command line before executing python script:
chcp 65001 & set PYTHONIOENCODING=utf-8

Like Giampaolo Rodolà's answer, but even more dirty: I really, really intend to spend a long time (soon) understanding the whole subject of encodings and how they apply to Windoze consoles,
For the moment I just wanted sthg which would mean my program would NOT CRASH, and which I understood ... and also which didn't involve importing too many exotic modules (in particular I'm using Jython, so half the time a Python module turns out not in fact to be available).
def pr(s):
try:
print(s)
except UnicodeEncodeError:
for c in s:
try:
print( c, end='')
except UnicodeEncodeError:
print( '?', end='')
NB "pr" is shorter to type than "print" (and quite a bit shorter to type than "safeprint")...!

Kind of related on the answer by J. F. Sebastian, but more direct.
If you are having this problem when printing to the console/terminal, then do this:
>set PYTHONIOENCODING=UTF-8

For Python 2 try:
print unicode(string, 'unicode-escape')
For Python 3 try:
import os
string = "002 Could've Would've Should've"
os.system('echo ' + string)
Or try win-unicode-console:
pip install win-unicode-console
py -mrun your_script.py

TL;DR:
print(yourstring.encode('ascii','replace').decode('ascii'))
I ran into this myself, working on a Twitch chat (IRC) bot. (Python 2.7 latest)
I wanted to parse chat messages in order to respond...
msg = s.recv(1024).decode("utf-8")
but also print them safely to the console in a human-readable format:
print(msg.encode('ascii','replace').decode('ascii'))
This corrected the issue of the bot throwing UnicodeEncodeError: 'charmap' errors and replaced the unicode characters with ?.

Python 3.6 windows7: There is several way to launch a python you could use the python console (which has a python logo on it) or the windows console (it's written cmd.exe on it).
I could not print utf8 characters in the windows console. Printing utf-8 characters throw me this error:
OSError: [winError 87] The paraneter is incorrect
Exception ignored in: (_io-TextIOwrapper name='(stdout)' mode='w' ' encoding='utf8')
OSError: [WinError 87] The parameter is incorrect
After trying and failing to understand the answer above I discovered it was only a setting problem. Right click on the top of the cmd console windows, on the tab font chose lucida console.

The cause of your problem is NOT the Win console not willing to accept Unicode (as it does this since I guess Win2k by default). It is the default system encoding. Try this code and see what it gives you:
import sys
sys.getdefaultencoding()
if it says ascii, there's your cause ;-)
You have to create a file called sitecustomize.py and put it under python path (I put it under /usr/lib/python2.5/site-packages, but that is differen on Win - it is c:\python\lib\site-packages or something), with the following contents:
import sys
sys.setdefaultencoding('utf-8')
and perhaps you might want to specify the encoding in your files as well:
# -*- coding: UTF-8 -*-
import sys,time
Edit: more info can be found in excellent the Dive into Python book

Nowadays, the Windows console does not encounter this error, unless you redirect the output.
Here is an example Python script scratch_1.py:
s = "∞"
print(s)
If you run the script as follows, everything works as intended:
python scratch_1.py
∞
However, if you run the following, then you get the same error as in the question:
python scratch_1.py > temp.txt
Traceback (most recent call last):
File "C:\Users\Wok\AppData\Roaming\JetBrains\PyCharmCE2022.2\scratches\scratch_1.py", line 3, in <module>
print(s)
File "C:\Users\Wok\AppData\Local\Programs\Python\Python311\Lib\encodings\cp1252.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
UnicodeEncodeError: 'charmap' codec can't encode character '\u221e' in position 0: character maps to <undefined>
To solve this issue with the suggestion present in the original question, i.e. by replacing the erroneous characters with question marks ?, one can proceed as follows:
s = "∞"
try:
print(s)
except UnicodeEncodeError:
output_str = s.encode("ascii", errors="replace").decode("ascii")
print(output_str)
It is important:
to call decode(), so that the type of the output is str instead of bytes,
with the same encoding, here "ascii", to avoid the creation of mojibake.

James Sulak asked,
Is there any way I can make Python automatically print a ? instead of failing in this situation?
Other solutions recommend we attempt to modify the Windows environment or replace Python's print() function. The answer below comes closer to fulfilling Sulak's request.
Under Windows 7, Python 3.5 can be made to print Unicode without throwing a UnicodeEncodeError as follows:
In place of:
print(text)
substitute:
print(str(text).encode('utf-8'))
Instead of throwing an exception, Python now displays unprintable Unicode characters as \xNN hex codes, e.g.:
Halmalo n\xe2\x80\x99\xc3\xa9tait plus qu\xe2\x80\x99un point noir
Instead of
Halmalo n’était plus qu’un point noir
Granted, the latter is preferable ceteris paribus, but otherwise the former is completely accurate for diagnostic messages. Because it displays Unicode as literal byte values the former may also assist in diagnosing encode/decode problems.
Note: The str() call above is needed because otherwise encode() causes Python to reject a Unicode character as a tuple of numbers.

The issue is with windows default encoding being set to cp1252, and need to be set to utf-8. (check PEP)
Check default encoding using:
import locale
locale.getpreferredencoding()
You can override locale settings
import os
if os.name == "nt":
import _locale
_locale._gdl_bak = _locale._getdefaultlocale
_locale._getdefaultlocale = (lambda *args: (_locale._gdl_bak()[0], 'utf8'))
referenced code from stack link

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

UnicodeEncodeError if piping output to wc -l - python

Related

pprint: UnicodeEncodeError: 'ascii' codec can't encode character

Python3 utf-8 decode issue

Where does Python get the preferred encoding from? [duplicate]

python 2.7 ignores default encoding set in sitecustomize.py when parsing scripts

Python, Unicode, and the Windows console

Categories

Resources