Python3 utf-8 decode issue

Python3 utf-8 decode issue - python

The following code runs fine with Python3 on my Windows machine and prints the character 'é':
data = b"\xc3\xa9"
print(data.decode('utf-8'))
However, running the same on an Ubuntu based docker container results in :
UnicodeEncodeError: 'ascii' codec can't encode character '\xe9' in position 0: ordinal not in range(128)
Is there anything that I have to install to enable utf-8 decoding ?

Seems ubuntu - depending on version - uses one encoding or another as default, and it may vary between shell and python as well. Adopted from this posting and also this blog:
Thus the recommended way seems to be to tell your python instance to use utf-8 as default encoding:
Set your default encoding of python source files via environment variable:
export PYTHONIOENCODING=utf8
Also, in your source files you can state the encoding you prefer to be used explicitly, so it should work irrespective of environment setting (see this question + answer, python docs and PEP 263:
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
....
Concerning the interpretation of encoding of files read by python, you can specify it explicitly in the open command
with open(fname, "rt", encoding="utf-8") as f:
...
and there's a more hackish way with some side effects, but saves you to explicitly specify it each time
import sys
# sys.setdefaultencoding() does not exist, here!
reload(sys) # Reload does the trick!
sys.setdefaultencoding('UTF8')
Please read the warnings about this hack in the related answer and comments.

The problem is with the print() expression, not with the decode() method.
If you look closely, the raised exception is a UnicodeEncodeError, not a -DecodeError.
Whenever you use the print() function, Python converts its arguments to a str and subsequently encodes the result to bytes, which are sent to the terminal (or whatever Python is run in).
The codec which is used for encoding (eg. UTF-8 or ASCII) depends on the environment.
In an ideal case,
the codec which Python uses is compatible with the one which the terminal expects, so the characters are displayed correctly (otherwise you get mojibake like "Ã©" instead of "é");
the codec used covers a range of characters that is sufficient for your needs (such as UTF-8 or UTF-16, which contain all characters).
In your case, the second condition isn't met for the Linux docker you mention: the encoding used is ASCII, which only supports characters found on an old English typewriter.
These are a few options to address this problem:
Set environment variables: on Linux, Python's encoding defaults depend on this (at least partially). In my experience, this is a bit of a trial and error; setting LC_ALL to something containing "UTF-8" worked for me once. You'll have to put them in start-up script for the shell your terminal runs, eg. .bashrc.
Re-encode STDOUT, like so:
sys.stdout = open(sys.stdout.buffer.fileno(), 'w', encoding='utf8')
The encoding used has to match the one of the terminal.
Encode the strings yourself and send them to the binary buffer underlying sys.stdout, eg. sys.stdout.buffer.write("é".encode('utf8')). This is of course much more boilerplate than print("é"). Again, the encoding used has to match the one of the terminal.
Avoid print() altogether. Use open(fn, encoding=...) for output, the logging module for progress info – depending on how interactive your script is, this might be worthwhile (admittedly, you'll probably face the same encoding problem when writing to STDERR with the logging module).
There might be other options, but I doubt that there are nicer ones.

Related

pprint: UnicodeEncodeError: 'ascii' codec can't encode character

This is driving me crazy. I'm trying to pprint a dict with an é char, and it throws me out.
I'm using Python 3:
from pprint import pprint
knights = {'gallahad': 'the pure', 'robin': 'the bravé'}
pprint (knights)
Error:
File "/data/prod_envs/pythons/python36/lib/python3.6/pprint.py", line 176, in _format
stream.write(rep)
UnicodeEncodeError: 'ascii' codec can't encode character '\xe9' in position 43: ordinal not in range(128)
I read up on the Python ASCII doc, but there does not seem a quick way to solve this, other than taking the dict apart, and rewriting the offending value to an ASCII value via .encode, and then re-assembling the dict again
Is there any way I can get this to print without taking the dict apart?

This is unrelated to pprint: the module only formats the string into another string and then passes the formatted string to the underlying stream. So your error occurs when the é character (U+00E9) is written to stdout.
Now it really depends on the underlying OS and the configuration of the Python interpreter. In Linux or other Unix-like systems, you could try to declare a UTF-8 or Latin1 charset in your terminal session by setting the environment variable PYTHONIOENCODING before starting Python:
$ export PYTHONIOENCODING=Latin1
$ python
(or use PYTHONIOENCODING=utf8 depending on the actual encoding of your terminal or terminal window).

Standard input and output are file objects in Python. The Python 3 documentation says that, when these objects are created, if encoding is left unspecified then locale.getpreferredencoding(False) is called to fetch the locale's preferred encoding.
Your system should have been set up with one or more "locales" when GNU/Linux was installed (I'm guessing from your paths that you are using some version of GNU/Linux). On a "sensible" setup, the default locale should allow UTF-8. But if you only did a "minimal" installation (for example as part of setting up a container), or something like that, then it is possible that the system has set locale to "C" (the ultimate fallback locale), which does not support UTF-8.
Just because your terminal can accept UTF-8 (as demonstrated by using echo with a UTF-8 string), does not mean Python knows that UTF-8 is acceptable. If Python sees the locale set to "C" then it will assume only ASCII is allowed unless told otherwise.
You can check the current locale by typing locale at the shell prompt, and change it by setting the LC_ALL environment variable. But before changing it you must check with locale -a to see which locales are available on your system, otherwise your change may not be effective and you may get the "C" locale anyway. If your system has not been set up with the locale you want, you can add it if you have root access: most GNU/Linux distributions provide options to do this when you (re)configure a package called locales, so for example on Debian/Ubuntu-based distros, sudo dpkg-reconfigure locales should show you the options.
But sometimes you will be in the awkward position of having to write a Python script to run on a system that has not been set up with decent locales and there's nothing you can do about it because you don't have root and the sysadmin insists on giving you the absolute minimum. Then what do we do?
Well there are options within Python itself. You could run export PYTHONIOENCODING=utf-8 before running Python, to tell Python to use that encoding no matter what the locale says. Or you could give pprint a stream= parameter, set to a stream that you've opened yourself using open() with an encoding="utf-8" parameter (although this is no good if you want to use sys.stdout or os.popen instead of a file). Or you could upgrade to Python 3.7 and use sys.stdout.reconfigure(encoding='utf-8') (but this won't work in the Python 3.6 mentioned in the original question).
Or, you could import codecs and do w=codecs.getwriter("utf-8")(sys.stdout.buffer) and then pass stream=w to your pprint:
from pprint import pprint
import sys, codecs
w=codecs.getwriter("utf-8")(sys.stdout.buffer)
d = {"testing": "这是个考验"}
pprint (d, stream=w)

Python 2 and 3 compatible way to print unicode that degrades gracefully [duplicate]

I'm trying to find a generic solution to print unicode strings from a python script.
The requirements are that it must run in both python 2.7 and 3.x, on any platform, and with any terminal settings and environment variables (e.g. LANG=C or LANG=en_US.UTF-8).
The python print function automatically tries to encode to the terminal encoding when printing, but if the terminal encoding is ascii it fails.
For example, the following works when the environment "LANG=enUS.UTF-8":
x = u'\xea'
print(x)
But it fails in python 2.7 when "LANG=C":
UnicodeEncodeError: 'ascii' codec can't encode character u'\xea' in position 0: ordinal not in range(128)
The following works regardless of the LANG setting, but would not properly show unicode characters if the terminal was using a different unicode encoding:
print(x.encode('utf-8'))
The desired behavior would be to always show unicode in the terminal if it is possible and show some encoding if the terminal does not support unicode. For example, the output would be UTF-8 encoded if the terminal only supported ascii. Basically, the goal is to do the same thing as the python print function when it works, but in the cases where the print function fails, use some default encoding.

You can handle the LANG=C case by telling sys.stdout to default to UTF-8 in cases when it would otherwise default to ASCII.
import sys, codecs
if sys.stdout.encoding is None or sys.stdout.encoding == 'ANSI_X3.4-1968':
utf8_writer = codecs.getwriter('UTF-8')
if sys.version_info.major < 3:
sys.stdout = utf8_writer(sys.stdout, errors='replace')
else:
sys.stdout = utf8_writer(sys.stdout.buffer, errors='replace')
print(u'\N{snowman}')
The above snippet fulfills your requirements: it works in Python 2.7 and 3.4, and it doesn't break when LANG is in a non-UTF-8 setting such as C.
It is not a new technique, but it's surprisingly hard to find in the documentation. As presented above, it actually respects non-UTF-8 settings such as ISO 8859-*. It only defaults to UTF-8 if Python would have bogusly defaulted to ASCII, breaking the application.

I don't think you should try and solve this at the Python level. Document your application requirements, log the locale of systems you run on so it can be included in bug reports and leave it at that.
If you do want to go this route, at least distinguish between terminals and pipes; you should never output data to a terminal that the terminal cannot explicitly handle; don't output UTF-8 for example, as the non-printable codepoints > U+007F could end up being interpreted as control codes when encoded.
For a pipe, output UTF-8 by default and make it configurable.
So you'd detect if a TTY is being used, then handle encoding based on that; for a terminal, set an error handler (pick one of replace or backslashreplace to provide replacement characters or escape sequences for whatever characters cannot be handled). For a pipe, use a configurable codec.
import codecs
import os
import sys
if os.isatty(sys.stdout.fileno()):
output_encoding = sys.stdout.encoding
errors = 'replace'
else:
output_encoding = 'utf-8' # allow override from settings
errors = None # perhaps parse from settings, not needed for UTF8
sys.stdout = codecs.getwriter(output_encoding)(sys.stdout, errors=errors)

You can encode the string yourself with the special parameter 'backslashreplace' so that unrepresentable characters are converted to escape sequences. In Python 2 you can print the result of encode directly, but in Python 3 you need to decode it back to Unicode first.
import sys
encoding = sys.stdout.encoding
print(s.encode(encoding, 'backslashreplace').decode(encoding))
If sys.stdout.encoding doesn't deliver the value that your terminal can handle, that's a separate problem that you must deal with.

You can handle the exception:
def always_print(s):
try:
print(s)
except UnicodeEncodeError:
print(s.encode('utf-8'))

python 2.7 ignores default encoding set in sitecustomize.py when parsing scripts

I’m having problems getting python 2.7 to read scripts containing utf-8 strings; setting the default encoding to utf-8 in sitecustomize.py doesn’t seem to take.
Here’s my sitecustomize.py:
import sys
sys.setdefaultencoding("utf-8")
I can verify that the default encoding has been changed from the command line:
$ /usr/bin/python -c 'import sys; print(sys.getdefaultencoding())'
utf-8
However, when I try to run a script containing a utf-8 string, as in test.py below (containing · at code point U+00b7)…
filename = 'utf-8·filename.txt'
print(filename)
…the default encoding seems to be ignored:
$ /usr/bin/python test.py
File "test.py", line 1
SyntaxError: Non-ASCII character '\xc2' in file test.py on line 1, but
no encoding declared; see http://www.python.org/peps/pep-0263.html for details
Using an encoding declaration, as in test-coding.py below…
# coding=utf-8
filename = 'utf-8·filename.txt'
print(filename)
…does work:
$ /usr/bin/python test-coding.py
utf-8·filename.txt
Unfortunately, the problem’s come up with scripts that are generated and run by another program (the catkin build system’s catkin_make). I can’t manually add encoding declarations to these scripts before catkin_make runs them, giving SyntaxError & check PEP 263. Changing the default encoding seems like the only solution short of going deep under catkin’s hood, or eliminating all non-ascii paths on my system… and setting it in sitecustomize.py should work, but doesn’t.
Any ideas or insights greatly appreciated!

sys.setdefaultencoding("utf-8") is not doing what you think it is doing. It has no effect on how Python parses source files. That's why you are still seeing SyntaxErrors when the source files use non-ascii characters. To eliminate those errors you need to add an encoding declaration at the beginning of the source file, such as
# -*- encoding: utf-8 -*-
Regarding sys.setdefaultencoding:
Do not try to change the default encoding. The default encoding is used when Python does silent conversion between str
and unicode. For example,
Expected Python2 behavior:
In [1]: '€' + u'€'
should raise UnicodeDecodeError because Python tries to convert '€' to unicode by
computing '€'.decode(sys.getdefaultencoding())
If you change the default encoding, you get different behavior:
In [2]: import sys; reload(sys); sys.setdefaultencoding('utf-8')
<module 'sys' (built-in)>
In [3]: '€' + u'€'
u'\u20ac\xe2\x82\xac'
If you change the defaultencoding, then your Python's behavior will be different than just about all other people's expectation of how Python2 should behave.

You cannot set the default encoding for source files. That default is hardcoded, as part of the language specification.
Set the PEP 263 header instead, as the interpreter is instructing you to do. You'll have to fix the Catkin build system, or rewrite the files it produces to include the header. Simply add a first or second line to those files with # coding=utf8, a task easily accomplished with Python.
The system default encoding is only used for implicit encoding and decoding of Unicode and byte string objects in running code. You should not try and change that, as other often relies on the value to not change. The ability to set it was removed entirely from Python 3.

What is the default encoding method for code assumed by Python interpreter?

Some people use the following to declare the encoding method for the text of their Python source code:
# -*- coding: utf-8 -*-
Back in 2001, it is said the default encoding method that Python interpreter assumes is ASCII. I have dealt with strings using non-ASCII characters in my Python code, without declaring encoding method of my code, and I don't remember I have bumped into encoding error before. What is the default encoding for code assumed by Python interpreter now?
I am not sure if this is relevant.
My OS is Ubuntu, and I am using the default Python interpreter, and gedit or emacs for editing.
Will the default encoding method by Python interpreter changes if the above changes?
Thanks.

Without any explicit encoding declaration, the assumed encoding for your source code will be
ascii for Python 2.x
utf-8 for Python 3.x
See PEP 0263 and Using source code encoding for Python 2.x, and PEP 3120 for the new default of utf-8 for Python 3.x.
So the default encoding assumened for source code will be directly dependent of the version of the Python interpreter, and it is not configurable.
Note that the source code encoding is something entirely different than dealing with non-ASCII characters as part of your data in strings.
There are two distinct cases where you may encounter non-ASCII characters:
As part of your programs data, during runtime
As part of your source code (and since you can't have non-ASCII characters in identifiers, that usually means hard coded string data in your source code or comments).
The source code encoding declaration affects what encoding your source code will be interpreted with - so it's only needed if you decide to directly put non-ASCII characters in your source code.
So, the following code will eventually have to deal with the fact that there might be non-ASCII characters in data.txt:
with open('data.txt') as f:
for line in f:
# do something with `line`
But it doesn't contain any non-ASCII characters in the source code, therefore it doesn't need an encoding declaration at the top of the file. It will however need to properly decode line if it wants to turn it into unicode. Simply doing unicode(line) will use the system default encoding, which is ascii (different from the default source encoding, but happens to also be ascii). So to explicitely decode the string using utf-8 you'd need to do line.decode('utf-8').
This code however does contain non-ASCII characters directly in its source code:
TEST_DATA = 'Bär' # <--- non-ASCII character on this line
print TEST_DATA
And it will fail with a SyntaxError similar to this, unless you declare an explicit source code encoding:
SyntaxError: Non-ASCII character '\xc3' in file foo.py on line 1, but no encoding declared;
see http://www.python.org/peps/pep-0263.html for details
So assuming your text editor is configured to save files in utf-8, you'd need to put the line
# -*- coding: utf-8 -*-
at the top of the file for Python to interpret the source code correctly.
My advice however would be to generally avoid putting non-ASCII characters in your source code, exactly because if it depends on your and your co-workers editor and terminal settings wheter it will be written and read correctly.
Instead you can use escaped strings to safely enter non-ASCII characters in your code:
TEST_DATA = 'B\xc3\xa4r'

By default, Python source files are treated as encoded in UTF-8. In that encoding, — although the standard library only uses ASCII characters for identifiers, a convention that any portable code should follow. To display all these characters properly, the editor must recognize that the file is UTF-8, and it must use a font that supports all the characters in the file.
It is also possible to specify a different encoding for source files. In order to do this, we put the below code on top of our code !
# -*- coding: encoding -*-
https://docs.python.org/dev/tutorial/interpreter.html

How to convert html entities into symbols?

I have made some adaptations to the script from this answer. and I am having problems with unicode. Some of the questions end up being written poorly.
Some answers and responses end up looking like:
Yeah.. I know.. I’m a simpleton.. So what’s a Singleton? (2)
How can I make the ’ to be translated to the right character?
Note: If that matters, I'm using python 2.6, on a French windows.
>>> sys.getdefaultencoding()
'ascii'
>>> sys.getfilesystemencoding()
'mbcs'
EDIT1: Based on Ryan Ginstrom's post, I have been able to correct a part of the output, but I am having problems with python's unicode.
In Idle / python shell:
Yeah.. I know.. Iâ€™m a simpleton.. So
whatâ€™s a Singleton?
In a text file, when redirecting stdout
Yeah.. I know.. I’m a simpleton.. So
what’s a Singleton?
How can I correct that ?
Edit2: I have tried Jarret Hardie's solution but it didn't do anything.
I am on windows, using python 2.6, so my site-packages folder is at:
C:\Python26\Lib\site-packages
There was no siteconfig.py file, so I created one, pasted the code provided by Jarret Hardie, started a python interpreter, but seems like it has not been loaded.
sys.getdefaultencoding()
'ascii'
I noticed there is a site.py file at :
C:\Python26\Lib\site.py
I tried changing the encoding in the function
def setencoding():
"""Set the string encoding used by the Unicode implementation. The
default is 'ascii', but if you're willing to experiment, you can
change this."""
encoding = "ascii" # Default value set by _PyUnicode_Init()
if 0:
# Enable to support locale aware default string encodings.
import locale
loc = locale.getdefaultlocale()
if loc[1]:
encoding = loc[1]
if 0:
# Enable to switch off string to Unicode coercion and implicit
# Unicode to string conversion.
encoding = "undefined"
if encoding != "ascii":
# On Non-Unicode builds this will raise an AttributeError...
sys.setdefaultencoding(encoding) # Needs Python Unicode build !
to set the encoding to utf-8. It worked (after a restart of python of course).
>>> sys.getdefaultencoding()
'utf-8'
The sad thing is that it didn't correct the caracters in my program. :(

You should be able to convert HTMl/XML entities into Unicode characters. Check out this answer in SO:
Decoding HTML Entities With Python
Basically you want something like this:
from BeautifulSoup import BeautifulStoneSoup
soup = BeautifulStoneSoup(urllib2.urlopen(URL),
convertEntities=BeautifulStoneSoup.ALL_ENTITIES)

Does changing your default encoding in siteconfig.py work?
In your site-packages file (on my OS X system it's in /Library/Python/2.5/site-packages/) create a file called siteconfig.py. In this file put:
import sys
sys.setdefaultencoding('utf-8')
The setdefaultencoding method is removed from the sys module once siteconfig.py is processed, so you must put it in site-packages so that Python will read it when the interpreter starts up.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python3 utf-8 decode issue - python

Related

pprint: UnicodeEncodeError: 'ascii' codec can't encode character

Python 2 and 3 compatible way to print unicode that degrades gracefully [duplicate]

python 2.7 ignores default encoding set in sitecustomize.py when parsing scripts

What is the default encoding method for code assumed by Python interpreter?

How to convert html entities into symbols?

Categories

Resources