I'm trying to understand how PYTHONIOENCODING environment variable fits with Python2.7, so I tried the following things with the interactive prompt:
antox#antox-pc ~/Scrivania $ export PYTHONIOENCODING='latin1'
antox#antox-pc ~/Scrivania $ /usr/bin/python2.7
Python 2.7.6 (default, Mar 22 2014, 22:59:56)
[GCC 4.8.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys
>>> sys.stdin.encoding
'latin1'
>>> sys.stdout.encoding
'latin1'
>>> b = 'ÿ'
>>> b
'\xc3\xbf' #Shouldn't I get something like '\xff' because I set PYTHONIOENCODING to latin1? It looks as if utf-8 is been used instead
>>> print '\xff'
� # Why this odd character? Shouldn't I get 'ÿ' always for the reason above?
My questions/doubts are indicated as comments.
By setting PYTHONIOENCODING in the environment, you're telling Python to not trust your terminal/OS's information regarding the encoding -- you're saying that you know better, and the terminal device actually accepts that encoding, not whatever the OS &c will tell Python.
So in this case you're saying that (whatever it claims otherwise) your terminal actually accepts and properly formats bytes in latin-1.
That is probably not the case (if you don't set that environment variable what does sys.stdout.encoding say? utf-8, I guess?) so it's not surprising that you don't get the display you want:-).
On your specific question,
sys.getdefaultencoding()
tells you what encoding Python will use to translate between actual text (that is, Unicode) and byte strings, in situations where it has no other indication (I/O to stdin/stdout is not one of those situations, as it uses the encoding attribute of those files).
>>> b = 'ÿ'
This has nothing to do with sys.stdin/stdout -- rather, your terminal is sending, after the open quote, some "escape sequence" that boils down to proper utf-8 (my Mac's Terminal app does, for example). If this was in a .py file without a proper source-encoding preamble, it would be a syntax error -- the interactive interpreter has become a softy in 2.7.9:-)
>>> print '\xff'
� # Why this odd character? Shouldn't I get 'ÿ' always for the reason above?
You've told Python that your terminal accepts and properly displays latin-1 byte sequences (even though the terminal probably wants utf-8 ones and tells Python that, you've told Python to ignore what the terminal says about its encoding, or rather, what the OS says the terminal says:-).
So the byte of value 255 is sent as-is, and the terminal doesn't like it one bit (since the terminal doesn't actually accept latin-1!) and displays an error-marker.
Here's a typical example on my Mac (where the Terminal does actually accept 'utf-8'):
ozone:~ alex$ PYTHONIOENCODING=latin-1 python -c "print u'\xff'"
?
ozone:~ alex$ PYTHONIOENCODING=utf-8 python -c "print u'\xff'"
ÿ
ozone:~ alex$ python -c "print u'\xff'"
ÿ
Letting Python properly detect the terminal encoding on its own, or forcing it to what happens to be the right one, displays correctly.
Forcing the encoding to one the terminal does not in fact accept, unsurprisingly, does not display correctly.
Should you ever attach to your machine's serial port an ancient teletype which does in fact accept latin-1 (but the OS doesn't detect that fact properly), PYTHONIOENCODING will help you properly do Python I/O on that ancient teletype. Otherwise, it's unlikely that said environment setting will be of much use to you:-).
Related
I'm trying to redirect output of python script to a file. When output contains non-ascii characters it works on macOS and Linux, but not on Windows.
I've deduced the problem to a simple test. The following is what is shown in Windows command prompt window. The test is only one print call.
Microsoft Windows [Version 10.0.17134.472]
(c) 2018 Microsoft Corporation. All rights reserved.
D:\>set PY
PYTHONIOENCODING=utf-8
D:\>type pipetest.py
print('\u0422\u0435\u0441\u0442')
D:\>python pipetest.py
Тест
D:\>python pipetest.py > test.txt
D:\>type test.txt
Тест
D:\>type test.txt | iconv -f utf-8 -t utf-8
Тест
D:\>set PYTHONIOENCODING=
D:\>python pipetest.py
Тест
D:\>python pipetest.py > test.txt
Traceback (most recent call last):
File "pipetest.py", line 1, in <module>
print('\u0422\u0435\u0441\u0442')
File "C:\Python\Python37\lib\encodings\cp1252.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode characters in position 0-3: character maps to <undefined>
D:\>python -V
Python 3.7.2
As one can see setting PYTHONIOENCODING environment variable helps but I don't understand why it needed to be set. When output is terminal it works but if output is a file it fails. Why does cp1252 is used when stdout is not a console?
Maybe it is a bug and can be fixed in Windows version of python?
Based on Python documentation, Windows version use different character encoding on console device (utr-8) and non-character devices such as disk files and pipes (system locale). PYTHONIOENCODING can be used to override it.
https://docs.python.org/3/library/sys.html#sys.stdout
Another method is change the encoding directly in the program, I tried and it works fine.
sys.stdout.reconfigure(encoding='utf-8')
https://docs.python.org/3/library/io.html#io.TextIOWrapper.reconfigure
Python needs to write binary data to stdout (not strings) hence requirement for encoding parameter.
Encoding (used to convert strings into bytes) is determined differently for each platform:
on Linux and macOS it comes from current locale;
on Windows what is used is "Current language for non-Unicode programs" (codepage set in command line window is irrelevant).
(Thanks to #Eric Leung for precise link)
The follow up question would be why Python on Windows uses current system locale for non-Unicode programs, and not what is set by chcp command, but I will leave it for someone else.
Also it needs to be mentioned there's a checkbox titled "Beta: Use Unicode UTF-8..." in Region Settings on Windows 10 (to open - Win+R, type intl.cpl). By checking the checkbox the above example works without error. But this checkbox is off by default and really deep in system settings.
I'm running a recent Linux system where all my locales are UTF-8:
LANG=de_DE.UTF-8
LANGUAGE=
LC_CTYPE="de_DE.UTF-8"
LC_NUMERIC="de_DE.UTF-8"
LC_TIME="de_DE.UTF-8"
...
LC_IDENTIFICATION="de_DE.UTF-8"
LC_ALL=
Now I want to write UTF-8 encoded content to the console.
Right now Python uses UTF-8 for the FS encoding but sticks to ASCII for the default encoding :-(
>>> import sys
>>> sys.getdefaultencoding()
'ascii'
>>> sys.getfilesystemencoding()
'UTF-8'
I thought the best (clean) way to do this was setting the PYTHONIOENCODING environment variable. But it seems that Python ignores it. At least on my system I keep getting ascii as default encoding, even after setting the envvar.
# tried this in ~/.bashrc and ~/.profile (also sourced them)
# and on the commandline before running python
export PYTHONIOENCODING=UTF-8
If I do the following at the start of a script, it works though:
>>> import sys
>>> reload(sys) # to enable `setdefaultencoding` again
<module 'sys' (built-in)>
>>> sys.setdefaultencoding("UTF-8")
>>> sys.getdefaultencoding()
'UTF-8'
But that approach seems unclean. So, what's a good way to accomplish this?
Workaround
Instead of changing the default encoding - which is not a good idea (see mesilliac's answer) - I just wrap sys.stdout with a StreamWriter like this:
sys.stdout = codecs.getwriter(locale.getpreferredencoding())(sys.stdout)
See this gist for a small utility function, that handles it.
It seems accomplishing this is not recommended.
Fedora suggested using the system locale as the default,
but apparently this breaks other things.
Here's a quote from the mailing-list discussion:
The only supported default encodings in Python are:
Python 2.x: ASCII
Python 3.x: UTF-8
If you change these, you are on your own and strange things will
start to happen. The default encoding does not only affect
the translation between Python and the outside world, but also
all internal conversions between 8-bit strings and Unicode.
Hacks like what's happening in the pango module (setting the
default encoding to 'utf-8' by reloading the site module in
order to get the sys.setdefaultencoding() API back) are just
downright wrong and will cause serious problems since Unicode
objects cache their default encoded representation.
Please don't enable the use of a locale based default encoding.
If all you want to achieve is getting the encodings of
stdout and stdin correctly setup for pipes, you should
instead change the .encoding attribute of those (only).
--
Marc-Andre Lemburg
eGenix.com
This is how I do it:
#!/usr/bin/python2.7 -S
import sys
sys.setdefaultencoding("utf-8")
import site
Note the -S in the bangline. That tells Python to not automatically import the site module. The site module is what sets the default encoding and the removes the method so it can't be set again. But will honor what is already set.
How to print UTF-8 encoded text to the console in Python < 3?
print u"some unicode text \N{EURO SIGN}"
print b"some utf-8 encoded bytestring \xe2\x82\xac".decode('utf-8')
i.e., if you have a Unicode string then print it directly. If you have
a bytestring then convert it to Unicode first.
Your locale settings (LANG, LC_CTYPE) indicate a utf-8 locale and
therefore (in theory) you could print a utf-8 bytestring directly and it
should be displayed correctly in your terminal (if terminal settings
are consistent with the locale settings and they should be) but you
should avoid it: do not hardcode the character encoding of your
environment inside your script; print Unicode directly instead.
There are many wrong assumptions in your question.
You do not need to set PYTHONIOENCODING with your locale settings,
to print Unicode to the terminal. utf-8 locale supports all Unicode characters i.e., it works as is.
You do not need the workaround sys.stdout =
codecs.getwriter(locale.getpreferredencoding())(sys.stdout). It may
break if some code (that you do not control) does need to print bytes
and/or it may break while
printing Unicode to Windows console (wrong codepage, can't print undecodable characters). Correct locale settings and/or PYTHONIOENCODING envvar are enough. Also, if you need to replace sys.stdout then use io.TextIOWrapper() instead of codecs module like win-unicode-console package does.
sys.getdefaultencoding() is unrelated to your locale settings and to
PYTHONIOENCODING. Your assumption that setting PYTHONIOENCODING
should change sys.getdefaultencoding() is incorrect. You should
check sys.stdout.encoding instead.
sys.getdefaultencoding() is not used when you print to the
console. It may be used as a fallback on Python 2 if stdout is
redirected to a file/pipe unless PYTHOHIOENCODING is set:
$ python2 -c'import sys; print(sys.stdout.encoding)'
UTF-8
$ python2 -c'import sys; print(sys.stdout.encoding)' | cat
None
$ PYTHONIOENCODING=utf8 python2 -c'import sys; print(sys.stdout.encoding)' | cat
utf8
Do not call sys.setdefaultencoding("UTF-8"); it may corrupt your
data silently and/or break 3rd-party modules that do not expect
it. Remember sys.getdefaultencoding() is used to convert bytestrings
(str) to/from unicode in Python 2 implicitly e.g., "a" + u"b". See also,
the quote in #mesilliac's answer.
If the program does not display the appropriate characters on the screen,
i.e., invalid symbol,
run the program with the following command line:
PYTHONIOENCODING=utf8 python3 yourprogram.py
Or the following, if your program is a globally installed module:
PYTHONIOENCODING=utf8 yourprogram
On some platforms as Cygwin (mintty.exe terminal) with Anaconda Python (or Python 3), simply run export PYTHONIOENCODING=utf8 and
later run the program does not work,
and you are required to always do every time PYTHONIOENCODING=utf8 yourprogram to run the program correctly.
On Linux, in case of sudo, you can try to do pass the -E argument to export the user variables to the sudo process:
export PYTHONIOENCODING=utf8
sudo -E python yourprogram.py
If you try this and it did no work, you will need to enter on a sudo shell:
sudo /bin/bash
PYTHONIOENCODING=utf8 yourprogram
Related:
How to print UTF-8 encoded text to the console in Python < 3?
Changing default encoding of Python?
Forcing UTF-8 over cp1252 (Python3)
Permanently set Python path for Anaconda within Cygwin
https://superuser.com/questions/1374339/what-does-the-e-in-sudo-e-do
Why bash -c 'var=5 printf "$var"' does not print 5?
https://unix.stackexchange.com/questions/296838/whats-the-difference-between-eval-and-exec
While realizing the OP question is for Linux: when ending up here through a search engine, on Windows 10 the following fixes the issue:
set PYTHONIOENCODING=utf8
python myscript.py
My Django app loads some files on startup (or when I execute management command). When I ssh from one of my Arch or Ubuntu machines all works fine, I am able to successfully run any commands and migrations.
But when I ssh from OS X (I have El Capital) and try to do same things I get this error:
UnicodeDecodeError: 'ASCII' codec can't decode byte 0xc3 in position 0: ordinal not in range(128)
To open my files I use with open(path_to_file) as f: ...
The error happens when sshing from both iterm and terminal. I found out that reason was LC_CTYPE environment variable. It wasn't set on my other Linux machines but on mac it was UTF-8 so after I ssh to the server it was set the same. The error was fixed after I unset LC_CTYPE.
So the actual question is what has happened and how to avoid this further? I can unset this variable in my local machine but will it take some negative effects? And what is the best way of doing this?
Your terminal at your local machine uses a character encoding. The encoding it uses appears to be UTF-8. When you log on to your server (BTW, what OS does it run?) the programs that run there need to know what encoding your terminal supports so that they display stuff as needed. They get this information from LC_CTYPE. ssh correctly sets it to UTF-8, because that's what your terminal supports.
When you unset LC_CTYPE, then your programs use the default, ASCII. The programs now display in ASCII instead of UTF-8, which works because UTF-8 is backward compatible with ASCII. However, if a program needs to display a special character that does not exist in ASCII, it won't work.
Although from the information you give it's not entirely clear to me why the system behaves in this way, I can tell you that unsetting LC_CTYPE is a bad workaround. To avoid problems in the future, it would be better to make sure that all your terminals in all your machines use UTF-8, and get rid of ASCII.
When you try to open a file, Python uses the terminal's (i.e. LC_CTYPE's) character set. I've never quite understood why it's made this way; why should the character set of your terminal indicate the encoding a file has? However, that's the way it's made and the way to fix the problem correctly is to use the encoding parameter of open if you are using Python 3, or the codecs standard library module if you are using Python 2.
I had a similar issue after updating my OS-X, ssh-ing to a UNIX server the copyright character was not encoded cause the UTF-8 locale was not properly set up. I solved the issue unchecking the setting "Set locale environment variables on startup" in the preferences of my terminal(s).
Ok, i want to print a string in my windows xp console.
There are several characters the console cant print, so i have to encode to my stdout.encoding which is 'cp437'. but printing the encoded string, the 'ß' is printed as '\xe1'. after decoding back to unicode and printing the string, i get the output i want. but this feels somewhat wrong. how is the correct way to print a string and get ? for non-printable characters?
>>>var
'Bla \u2013 großes'
>>>print(var)
UnicodeEncodeError: 'charmap' codec can't encode character '\u2013'
>>>var.encode('cp437', 'replace')
b'Bla ? gro\xe1es'
>>>print(var.encode('cp437', 'replace'))
b'Bla ? gro\xe1es'
>>>var.encode('cp437', 'replace').decode('cp437')
'Bla ? großes'
>>>print(var.encode('cp437', 'replace').decode('cp437'))
Bla ? großes
edit:
#Mark Ransom: since i print a lot this makes the code pretty bloated i feel :/
#eryksun: excactly what i was looking for. thanks a lot!
To print Unicode characters that can't be represented using the console codepage, you could use win-unicode-console Python package that uses Unicode API such as ReadConsoleW/WriteConsoleW() to read/write Unicode from/to Windows console directly:
#!/usr/bin/env python3
import win_unicode_console
win_unicode_console.enable()
try:
print('Bla \u2013 großes')
finally:
win_unicode_console.disable()
save it to test_unicode.py file, and run it:
C:\> py test_unicode.py
You should see:
Bla – großes
As a preferred alternative, you could use run module (included in the package), to run an ordinary script with enabled Unicode support in Windows console:
C:\> py -m run unmodified_script_that_prints_unicode.py
To install win_unicode_console module, run:
C:\> pip install win-unicode-console
Make sure to select a font able to display Unicode characters in Windows console.
To save the output of a Python script to a file, you could use PYTHONIOENCODING envvar:
C:\> set PYTHONIOENCODING=utf-8:backslashreplace
C:\> py unmodified_script_that_prints_unicode.py >output_utf8.txt
Do not hardcode the character encoding of your environment inside your script, print Unicode instead. The examples show that the same script may be used to print to the console and to a file using different encodings and different methods.
An alternate solution is to not use the crippled Windows console for general unicode output. Tk text widgets (accessed as tkinter Text instances) handle all BMP chars as long as the selected font will.
Since Idle used tkinter, it can as well. Running an Idle editor file (call it tem.py) containing
print('Bla \u2013 großes')
prints the following in the Shell window.
Bla – großes
A file can be run through Idle from the console with -m and -r.
C:\>python -m idlelib -r c:/programs/python34/tem.py
This opens a shell window and prints the same as above. Or you can create your own tk window with Label or Text widget.
Running Python in a standard GNU terminal emulator on Ubuntu 14.04, I get the expected behavior when typing interactively:
>>> len('tiθ')
4
>>> len(u'tiθ')
3
The same thing happens when running an explicitly utf8-encoded script in Spyder:
# -*- coding: utf-8 -*-
print(len('tiθ'))
print(len(u'tiθ'))
...gives the following output, regardless of whether I run it in a new dedicated interpreter, or run in a Spyder-default interpreter (shown here):
>>> runfile('/home/dan/Desktop/mwe.py', wdir=r'/home/dan/Desktop')
4
3
But when typing interactively in a Python console within Spyder:
>>> len('tiθ')
4
>>> len(u'tiθ')
4
This issue has been brought up elsewhere, but that question regards differences between Windows and Linux. Here, I'm getting different results in different consoles on the same system, and the Python startup message in the terminal emulator and in the console within Spyder are identical:
Python 2.7.6 (default, Mar 22 2014, 22:59:56)
[GCC 4.8.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
What is going on here, and how can I get Python-within-Spyder to behave like Python-in-the-shell with regard to unicode strings? #martijn-pieters makes the comment on this question that
Spyder does all sorts of things to break normal a Python environment.
But I'm hoping there's a way to un-break this particular feature, since it makes it really hard to debug scripts in the IDE when I can't rely on my interactive typed commands to yield the same results as scripts run as a whole with their coding: utf-8 declaration.
UPDATES
In the GNU terminal:
>>> repr(u'tiθ')
"u'ti\\u03b8'"
>>> import sys
>>> sys.stdin.encoding
'UTF-8'
>>> sys.getdefaultencoding()
'ascii'
In Spyder console:
>>> repr(u'tiθ')
"u'ti\\xce\\xb8'"
>>> import sys
>>> sys.stdin.encoding # returns None
>>> sys.getdefaultencoding()
'UTF-8'
So knowing that, can I convince Spyder to behave like the GNU terminal?
After a bit of research, it seems that the strange behavior is at least in part built into Python (see this discussion thread; briefly, Python sets sys.stdin.encoding to None as default, and only changes it if it detects that the host is tty and it can detect the tty's encoding).
That said, a hackish workaround was just to tell Spyder to use /usr/bin/python3 as its executable instead of the default (which was Python 2.7.6). When running a Python 3 console within Spyder or within a GNU terminal emulator, I get results different (better!) than before, but importantly the results are consistent, regardless of whether running a script or typing interactively, and regardless of using GNU terminal or Spyder console:
>>> len('tiθ')
3
>>> len(u'tiθ')
3
>>> import sys
>>> sys.stdin.encoding
'UTF-8'
>>> sys.getdefaultencoding()
'utf-8'
This leads to other problems within Spyder, however: its sitecustomize.py script is not Python 3 friendly, for example, so every new interpreter starts with
Error in sitecustomize; set PYTHONVERBOSE for traceback:
SyntaxError: invalid syntax (sitecustomize.py, line 432)
The interpreter seems to work okay anyway, but it makes me nervous enough that I'm not considering this an "acceptable" answer, so if anyone else has a better idea...