Windows CMD line errors in python for foreign language - python

I am downloading data from a MySQL database. Some of the data is in Korean. When I try to print the string before putting it in a table (Qt), the windows command prompt returns:
File "C:\Python27\lib\encodings\cp437.py", line 12, in encode
return codecs.charmap_encode(input,errors,encoding_map)
UnicodeEncodeError: 'charmap' codec can't encode characters in position 0-2: character maps to (undefined)
However, when I use IDLE to run the code, it prints the Korean language fine. This caused me alot of headache when trying to debug why my program was not working as I just click the python file from the folder to run it. Finally when using idle it turned out everything works.
Is there something wrong with my python installation, windows installation, or python code trying to just print the characters? I assumed it wouldnt be the python code as it works in IDLE. Also, using a special function to print in windows seems bad as it limits the codes portability to another OS (or will every OS have this problem?)

IDLE is based on tkinter, which is based on tcl/tk, which supports the entire Basic Multilingual Plane (BMP). (But tcl/tk does not support supplementary planes with other characters). On Windows, the Python interactive interpreter runs in the same console window used by Command Prompt. This only supports Code Page subsets of the BMP, sometimes only 256 of 2^^16 characters.
The codepage that supports ASCII and Korean is 949. (Easy Google search.) In Command Prompt, chcp 949 should change to that codepage. If you then start Python, you should be able to display Korean characters.

Related

Redirecting python output to a file causes UnicodeEncodeError on Windows

I'm trying to redirect output of python script to a file. When output contains non-ascii characters it works on macOS and Linux, but not on Windows.
I've deduced the problem to a simple test. The following is what is shown in Windows command prompt window. The test is only one print call.
Microsoft Windows [Version 10.0.17134.472]
(c) 2018 Microsoft Corporation. All rights reserved.
D:\>set PY
PYTHONIOENCODING=utf-8
D:\>type pipetest.py
print('\u0422\u0435\u0441\u0442')
D:\>python pipetest.py
Тест
D:\>python pipetest.py > test.txt
D:\>type test.txt
Тест
D:\>type test.txt | iconv -f utf-8 -t utf-8
Тест
D:\>set PYTHONIOENCODING=
D:\>python pipetest.py
Тест
D:\>python pipetest.py > test.txt
Traceback (most recent call last):
File "pipetest.py", line 1, in <module>
print('\u0422\u0435\u0441\u0442')
File "C:\Python\Python37\lib\encodings\cp1252.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode characters in position 0-3: character maps to <undefined>
D:\>python -V
Python 3.7.2
As one can see setting PYTHONIOENCODING environment variable helps but I don't understand why it needed to be set. When output is terminal it works but if output is a file it fails. Why does cp1252 is used when stdout is not a console?
Maybe it is a bug and can be fixed in Windows version of python?
Based on Python documentation, Windows version use different character encoding on console device (utr-8) and non-character devices such as disk files and pipes (system locale). PYTHONIOENCODING can be used to override it.
https://docs.python.org/3/library/sys.html#sys.stdout
Another method is change the encoding directly in the program, I tried and it works fine.
sys.stdout.reconfigure(encoding='utf-8')
https://docs.python.org/3/library/io.html#io.TextIOWrapper.reconfigure
Python needs to write binary data to stdout (not strings) hence requirement for encoding parameter.
Encoding (used to convert strings into bytes) is determined differently for each platform:
on Linux and macOS it comes from current locale;
on Windows what is used is "Current language for non-Unicode programs" (codepage set in command line window is irrelevant).
(Thanks to #Eric Leung for precise link)
The follow up question would be why Python on Windows uses current system locale for non-Unicode programs, and not what is set by chcp command, but I will leave it for someone else.
Also it needs to be mentioned there's a checkbox titled "Beta: Use Unicode UTF-8..." in Region Settings on Windows 10 (to open - Win+R, type intl.cpl). By checking the checkbox the above example works without error. But this checkbox is off by default and really deep in system settings.

What is the equivalent of '\r\x1b[K' on Windows cmd.exe in Python?

According to https://jcastellssala.com/2012/07/20/python-command-line-waiting-feedback-and-some-background-on-why/, '\r\x1b[K' is an escaping sequence that erases the current line in the console and rewrites something in Python. But when I tried to use the sequence on Windows cmd, it prints out weird characters instead. In Python, is there an equivalent sequence/action on Windows cmd where I can erase the last line I print out to the console?
Amazingly, support for ANSI escape sequences in the Windows console was only added in Windows 10 Version 1511:
http://www.nivot.org/blog/post/2016/02/04/Windows-10-TH2-(v1511)-Console-Host-Enhancements
They will not work in older versions of Windows, unless you use a terminal emulator which supports them, like ConEmu:
https://conemu.github.io/

UnicodeDecodeError when ssh from OS X

My Django app loads some files on startup (or when I execute management command). When I ssh from one of my Arch or Ubuntu machines all works fine, I am able to successfully run any commands and migrations.
But when I ssh from OS X (I have El Capital) and try to do same things I get this error:
UnicodeDecodeError: 'ASCII' codec can't decode byte 0xc3 in position 0: ordinal not in range(128)
To open my files I use with open(path_to_file) as f: ...
The error happens when sshing from both iterm and terminal. I found out that reason was LC_CTYPE environment variable. It wasn't set on my other Linux machines but on mac it was UTF-8 so after I ssh to the server it was set the same. The error was fixed after I unset LC_CTYPE.
So the actual question is what has happened and how to avoid this further? I can unset this variable in my local machine but will it take some negative effects? And what is the best way of doing this?
Your terminal at your local machine uses a character encoding. The encoding it uses appears to be UTF-8. When you log on to your server (BTW, what OS does it run?) the programs that run there need to know what encoding your terminal supports so that they display stuff as needed. They get this information from LC_CTYPE. ssh correctly sets it to UTF-8, because that's what your terminal supports.
When you unset LC_CTYPE, then your programs use the default, ASCII. The programs now display in ASCII instead of UTF-8, which works because UTF-8 is backward compatible with ASCII. However, if a program needs to display a special character that does not exist in ASCII, it won't work.
Although from the information you give it's not entirely clear to me why the system behaves in this way, I can tell you that unsetting LC_CTYPE is a bad workaround. To avoid problems in the future, it would be better to make sure that all your terminals in all your machines use UTF-8, and get rid of ASCII.
When you try to open a file, Python uses the terminal's (i.e. LC_CTYPE's) character set. I've never quite understood why it's made this way; why should the character set of your terminal indicate the encoding a file has? However, that's the way it's made and the way to fix the problem correctly is to use the encoding parameter of open if you are using Python 3, or the codecs standard library module if you are using Python 2.
I had a similar issue after updating my OS-X, ssh-ing to a UNIX server the copyright character was not encoded cause the UTF-8 locale was not properly set up. I solved the issue unchecking the setting "Set locale environment variables on startup" in the preferences of my terminal(s).

Python 3.4.3: Safe way to print unicode strings to a console? [duplicate]

Ok, i want to print a string in my windows xp console.
There are several characters the console cant print, so i have to encode to my stdout.encoding which is 'cp437'. but printing the encoded string, the 'ß' is printed as '\xe1'. after decoding back to unicode and printing the string, i get the output i want. but this feels somewhat wrong. how is the correct way to print a string and get ? for non-printable characters?
>>>var
'Bla \u2013 großes'
>>>print(var)
UnicodeEncodeError: 'charmap' codec can't encode character '\u2013'
>>>var.encode('cp437', 'replace')
b'Bla ? gro\xe1es'
>>>print(var.encode('cp437', 'replace'))
b'Bla ? gro\xe1es'
>>>var.encode('cp437', 'replace').decode('cp437')
'Bla ? großes'
>>>print(var.encode('cp437', 'replace').decode('cp437'))
Bla ? großes
edit:
#Mark Ransom: since i print a lot this makes the code pretty bloated i feel :/
#eryksun: excactly what i was looking for. thanks a lot!
To print Unicode characters that can't be represented using the console codepage, you could use win-unicode-console Python package that uses Unicode API such as ReadConsoleW/WriteConsoleW() to read/write Unicode from/to Windows console directly:
#!/usr/bin/env python3
import win_unicode_console
win_unicode_console.enable()
try:
print('Bla \u2013 großes')
finally:
win_unicode_console.disable()
save it to test_unicode.py file, and run it:
C:\> py test_unicode.py
You should see:
Bla – großes
As a preferred alternative, you could use run module (included in the package), to run an ordinary script with enabled Unicode support in Windows console:
C:\> py -m run unmodified_script_that_prints_unicode.py
To install win_unicode_console module, run:
C:\> pip install win-unicode-console
Make sure to select a font able to display Unicode characters in Windows console.
To save the output of a Python script to a file, you could use PYTHONIOENCODING envvar:
C:\> set PYTHONIOENCODING=utf-8:backslashreplace
C:\> py unmodified_script_that_prints_unicode.py >output_utf8.txt
Do not hardcode the character encoding of your environment inside your script, print Unicode instead. The examples show that the same script may be used to print to the console and to a file using different encodings and different methods.
An alternate solution is to not use the crippled Windows console for general unicode output. Tk text widgets (accessed as tkinter Text instances) handle all BMP chars as long as the selected font will.
Since Idle used tkinter, it can as well. Running an Idle editor file (call it tem.py) containing
print('Bla \u2013 großes')
prints the following in the Shell window.
Bla – großes
A file can be run through Idle from the console with -m and -r.
C:\>python -m idlelib -r c:/programs/python34/tem.py
This opens a shell window and prints the same as above. Or you can create your own tk window with Label or Text widget.

Encoding difference in Eclipse and Windows console

I have a Python script which works perfectly in Eclipse Console (Run configuration).
When I try to launch this script on a Windows 7 console, I have the encoding error:
UnicodeEncodeError: 'ascii' codec can't encode character u'\xc9' in position 0: ordinal not in range(128)
I changde the code page of my Windows console to use the same one as in Eclipse (Window->Perference->General->Worspace->Text file encoding):
chcp 1252
At the beginning of the script, I add:
# -*- coding: cp1252 -*-
But it changes nothing.
It works on Eclipse console, so I do not want to decode/encode all my strings for Windows console.
Have you any idea or advice to fix that behaviour?
You could try setting both eclipse's and the windows cmd line's encodings to Utf-8 and see if that works, unless you absolutely need the cp1252 encoding.
The issue is that Python will expect your 8 bit strings to contain ASCII only, not Unicode. u'\xc9 is a Unicode character. Perhaps Eclipse is more friendly than Windows 7 console. You should use the unicode command to convert characters to Unicode as you get them:
value = unicode(value, "utf-8")
See this article for more.

Categories