Encoding difference in Eclipse and Windows console - python

I have a Python script which works perfectly in Eclipse Console (Run configuration).
When I try to launch this script on a Windows 7 console, I have the encoding error:
UnicodeEncodeError: 'ascii' codec can't encode character u'\xc9' in position 0: ordinal not in range(128)
I changde the code page of my Windows console to use the same one as in Eclipse (Window->Perference->General->Worspace->Text file encoding):
chcp 1252
At the beginning of the script, I add:
# -*- coding: cp1252 -*-
But it changes nothing.
It works on Eclipse console, so I do not want to decode/encode all my strings for Windows console.
Have you any idea or advice to fix that behaviour?

You could try setting both eclipse's and the windows cmd line's encodings to Utf-8 and see if that works, unless you absolutely need the cp1252 encoding.

The issue is that Python will expect your 8 bit strings to contain ASCII only, not Unicode. u'\xc9 is a Unicode character. Perhaps Eclipse is more friendly than Windows 7 console. You should use the unicode command to convert characters to Unicode as you get them:
value = unicode(value, "utf-8")
See this article for more.

Related

Redirecting python output to a file causes UnicodeEncodeError on Windows

I'm trying to redirect output of python script to a file. When output contains non-ascii characters it works on macOS and Linux, but not on Windows.
I've deduced the problem to a simple test. The following is what is shown in Windows command prompt window. The test is only one print call.
Microsoft Windows [Version 10.0.17134.472]
(c) 2018 Microsoft Corporation. All rights reserved.
D:\>set PY
PYTHONIOENCODING=utf-8
D:\>type pipetest.py
print('\u0422\u0435\u0441\u0442')
D:\>python pipetest.py
Тест
D:\>python pipetest.py > test.txt
D:\>type test.txt
Тест
D:\>type test.txt | iconv -f utf-8 -t utf-8
Тест
D:\>set PYTHONIOENCODING=
D:\>python pipetest.py
Тест
D:\>python pipetest.py > test.txt
Traceback (most recent call last):
File "pipetest.py", line 1, in <module>
print('\u0422\u0435\u0441\u0442')
File "C:\Python\Python37\lib\encodings\cp1252.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode characters in position 0-3: character maps to <undefined>
D:\>python -V
Python 3.7.2
As one can see setting PYTHONIOENCODING environment variable helps but I don't understand why it needed to be set. When output is terminal it works but if output is a file it fails. Why does cp1252 is used when stdout is not a console?
Maybe it is a bug and can be fixed in Windows version of python?
Based on Python documentation, Windows version use different character encoding on console device (utr-8) and non-character devices such as disk files and pipes (system locale). PYTHONIOENCODING can be used to override it.
https://docs.python.org/3/library/sys.html#sys.stdout
Another method is change the encoding directly in the program, I tried and it works fine.
sys.stdout.reconfigure(encoding='utf-8')
https://docs.python.org/3/library/io.html#io.TextIOWrapper.reconfigure
Python needs to write binary data to stdout (not strings) hence requirement for encoding parameter.
Encoding (used to convert strings into bytes) is determined differently for each platform:
on Linux and macOS it comes from current locale;
on Windows what is used is "Current language for non-Unicode programs" (codepage set in command line window is irrelevant).
(Thanks to #Eric Leung for precise link)
The follow up question would be why Python on Windows uses current system locale for non-Unicode programs, and not what is set by chcp command, but I will leave it for someone else.
Also it needs to be mentioned there's a checkbox titled "Beta: Use Unicode UTF-8..." in Region Settings on Windows 10 (to open - Win+R, type intl.cpl). By checking the checkbox the above example works without error. But this checkbox is off by default and really deep in system settings.

UnicodeDecodeError when ssh from OS X

My Django app loads some files on startup (or when I execute management command). When I ssh from one of my Arch or Ubuntu machines all works fine, I am able to successfully run any commands and migrations.
But when I ssh from OS X (I have El Capital) and try to do same things I get this error:
UnicodeDecodeError: 'ASCII' codec can't decode byte 0xc3 in position 0: ordinal not in range(128)
To open my files I use with open(path_to_file) as f: ...
The error happens when sshing from both iterm and terminal. I found out that reason was LC_CTYPE environment variable. It wasn't set on my other Linux machines but on mac it was UTF-8 so after I ssh to the server it was set the same. The error was fixed after I unset LC_CTYPE.
So the actual question is what has happened and how to avoid this further? I can unset this variable in my local machine but will it take some negative effects? And what is the best way of doing this?
Your terminal at your local machine uses a character encoding. The encoding it uses appears to be UTF-8. When you log on to your server (BTW, what OS does it run?) the programs that run there need to know what encoding your terminal supports so that they display stuff as needed. They get this information from LC_CTYPE. ssh correctly sets it to UTF-8, because that's what your terminal supports.
When you unset LC_CTYPE, then your programs use the default, ASCII. The programs now display in ASCII instead of UTF-8, which works because UTF-8 is backward compatible with ASCII. However, if a program needs to display a special character that does not exist in ASCII, it won't work.
Although from the information you give it's not entirely clear to me why the system behaves in this way, I can tell you that unsetting LC_CTYPE is a bad workaround. To avoid problems in the future, it would be better to make sure that all your terminals in all your machines use UTF-8, and get rid of ASCII.
When you try to open a file, Python uses the terminal's (i.e. LC_CTYPE's) character set. I've never quite understood why it's made this way; why should the character set of your terminal indicate the encoding a file has? However, that's the way it's made and the way to fix the problem correctly is to use the encoding parameter of open if you are using Python 3, or the codecs standard library module if you are using Python 2.
I had a similar issue after updating my OS-X, ssh-ing to a UNIX server the copyright character was not encoded cause the UTF-8 locale was not properly set up. I solved the issue unchecking the setting "Set locale environment variables on startup" in the preferences of my terminal(s).

Python 3.5.2 Non-Ascii Character Output

I'm running Python 3.5.2 and am trying to do some basic stuff with unicode and UTF-8. I'm currently just trying to output non-ASCII characters and am unable to do so. For example, this:
ddd = '\u0144'
print(ddd)
gives me a Unicode encode error, telling me that the character maps to undefined. From what I understand of unicode in Python 3.5.2, mapping should happen automatically. I tried putting # -*- coding: utf-8 -*- before the code and various combinations of .decode and .encode as well, but to no avail.
PM 2Ring, typing in chcp 65001 in command prompt did the trick. Thanks!

Unsuppress UnicodeEncodeError exceptions when run from Aptana Studio PyDev

The following is a statement that should raise an UnicodeEncodeError exception:
print 'str+{}'.format(u'unicode:\u2019')
In a Python shell, the exception is raised as expected:
>>> print 'str+{}'.format(u'unicode:\u2019')
Traceback (most recent call last):
File "<pyshell#10>", line 1, in <module>
print 'str+{}'.format(u'unicode:\u2019')
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2019' in position 8: ordinal not in range(128)
However, if I place that line at the start of my settings.py and start the Django server from Aptana Studio, no error is raised and this line is printed:
str+unicode:’
But if I execute manage.py runserver from a shell, the exception is raised:
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2019' in position 8: ordinal not in range(128)
Is there some sort of Python setting that silently suppresses these unicode errors?
How can I prevent the unicode error from being ignored when I start the Django test server directly from Aptana Studio?
Using
Python 2.7.3
Aptana Studio 3.3.2
If you simply cast a bytestring to unicode, like
print unicode(s)
or mix unicode and bytestrings in string formatting operations like your example, Python will fall back on the system default encoding (which is ascii unless it has been changed), and implicitly will try to encode unicode / decode the bytestring using the ascii codec.
The currently active system default encoding can be displayed with
import sys
sys.getdefaultencoding()
Now it seems like Aptana Studio does in fact mess with your interpreters default encoding:
From a blog post by Mikko Ohtamaa:
[...] Looks like the culprint was PyDev (Eclipse Python plug-in). The
interfering source code is
here.
Looks like the reason was to co-operate with Eclipse console. However
it has been done incorrectly. Instead of setting the console encoding,
the encoding is set to whole Python run-time environment, messing up
the target run-time where the development is being done.
There is a possible fix for this problem. In Eclipse Run… dialog settings you can choose Console Encoding on Common tab. There
is a possible value US-ASCII. I am not sure what Python 2 thinks
“US-ASCII” encoding name, since the default is “ascii”.
So make sure you reset the default to ascii, and you should be good.

Windows CMD line errors in python for foreign language

I am downloading data from a MySQL database. Some of the data is in Korean. When I try to print the string before putting it in a table (Qt), the windows command prompt returns:
File "C:\Python27\lib\encodings\cp437.py", line 12, in encode
return codecs.charmap_encode(input,errors,encoding_map)
UnicodeEncodeError: 'charmap' codec can't encode characters in position 0-2: character maps to (undefined)
However, when I use IDLE to run the code, it prints the Korean language fine. This caused me alot of headache when trying to debug why my program was not working as I just click the python file from the folder to run it. Finally when using idle it turned out everything works.
Is there something wrong with my python installation, windows installation, or python code trying to just print the characters? I assumed it wouldnt be the python code as it works in IDLE. Also, using a special function to print in windows seems bad as it limits the codes portability to another OS (or will every OS have this problem?)
IDLE is based on tkinter, which is based on tcl/tk, which supports the entire Basic Multilingual Plane (BMP). (But tcl/tk does not support supplementary planes with other characters). On Windows, the Python interactive interpreter runs in the same console window used by Command Prompt. This only supports Code Page subsets of the BMP, sometimes only 256 of 2^^16 characters.
The codepage that supports ASCII and Korean is 949. (Easy Google search.) In Command Prompt, chcp 949 should change to that codepage. If you then start Python, you should be able to display Korean characters.

Categories