chcp 65001 codepage results in program termination without any error - python

Problem
The problem arises when I want to input Unicode character in Python interpreter (for simplicity I have used a-umlaut in the example, but I have first encountered this for Farsi characters). Whenever I use python with chcp 65001 code page and then try to input even one Unicode character, Python exits without any error.
I have spent days trying to solve this problem to no avail. But today, I found a thread on python website, another on MySQL and another on Lua-users which issues were raised regarding this sudden exit, although without any solution and some saying that chcp 65001 is inherently broken.
It would be good to know once and for all whether this problem is chcp-design-related or there is a possible workaround.
Reproduce Error
chcp 65001
Python 3.X:
Python shell
print('ä')
result: it just exits the shell
however, this works python.exe -c "print('ä')"
and also this : print('\u00e4')
result: ä
in Luajit2.0.4
print('ä')
result: it just exits the shell
however this works: print('\xc3\xa4')
I have come up with this observation so far:
direct output with the command prompt works.
Unicode-based , hex-based equivalent of the character works.
So
This is not a Python bug and that we can't use a Unicode character directly in CLI programs in Windows command prompt or any of its Wrapper like Conemu, Cmder (I am using Cmder to be able to see and use Unicode character in Windows shell and I have done so without any problem). Is this correct?

To use Unicode in the Windows console for Python 2.7 and 3.x (prior to 3.6), install and enable win_unicode_console. This uses the wide-character functions ReadConsoleW and WriteConsoleW, just like other Unicode-aware console programs such as cmd.exe and powershell.exe. For Python 3.6, a new io._WindowsConsoleIO raw I/O class has been added. It reads and writes UTF-8 encoded text (for cross-platform compatibility with Unix -- "get a byte" -- programs), but internally it uses the wide-character API by transcoding to and from UTF-16LE.
The problem you're experiencing with non-ASCII input is reproducible in the console for all Windows versions up to and including Windows 10. The console host process, i.e. conhost.exe, wasn't designed for UTF-8 (codepage 65001) and hasn't been updated to support it consistently. In particular, non-ASCII input causes an empty read. This in turn causes Python's REPL to exit and built-in input to raise EOFError.
The problem is that conhost encodes its UTF-16 input buffer assuming a single-byte codepage, such as the OEM and ANSI codepages in Western locales (e.g. 437, 850, 1252). UTF-8 is a multibyte encoding in which non-ASCII characters are encoded as 2 to 4 bytes. To handle UTF-8 it would need to encode in multiple iterations of M / 4 characters, where M is the remaining bytes available from the N-byte buffer. Instead it assumes a request to read N bytes is a request to read N characters. Then if the input has one or more non-ASCII characters, the internal WideCharToMultiByte call fails due to an undersized buffer, and the console returns a 'successful' read of 0 bytes.
You may not observe exactly this problem in Python 3.5 if the pyreadline module is installed. Python 3.5 automatically tries to import readline. In the case of pyreadline, input is read via the wide-character function ReadConsoleInputW. This is a low-level function to read console input records. In principle it should work, but in practice entering print('ä') gets read by the REPL as print(''). For a non-ASCII character, ReadConsoleInputW returns a sequence of Alt+Numpad KEY_EVENT records. The sequence is a lossy OEM encoding, which can be ignored except for the last record, which has the input character in the UnicodeChar field. Apparently pyreadline ignores the entire sequence.
Prior to Windows 8, output using codepage 65001 is also broken. It prints a trail of garbage text in proportion to the number of non-ASCII characters. In this case the problem is that WriteFile and WriteConsoleA incorrectly return the number of UTF-16 codes written to the screen buffer instead of the number of UTF-8 bytes. This confuses Python's buffered writer, leading to repeated writes of what it thinks are the remaining unwritten bytes. This problem was fixed in Windows 8 as part of rewriting the internal console API to use the ConDrv device instead of an LPC port. Older versions of Windows can use ConEmu or ANSICON to work around this bug.

Related

Python3 utf-8 decode issue

The following code runs fine with Python3 on my Windows machine and prints the character 'é':
data = b"\xc3\xa9"
print(data.decode('utf-8'))
However, running the same on an Ubuntu based docker container results in :
UnicodeEncodeError: 'ascii' codec can't encode character '\xe9' in position 0: ordinal not in range(128)
Is there anything that I have to install to enable utf-8 decoding ?
Seems ubuntu - depending on version - uses one encoding or another as default, and it may vary between shell and python as well. Adopted from this posting and also this blog:
Thus the recommended way seems to be to tell your python instance to use utf-8 as default encoding:
Set your default encoding of python source files via environment variable:
export PYTHONIOENCODING=utf8
Also, in your source files you can state the encoding you prefer to be used explicitly, so it should work irrespective of environment setting (see this question + answer, python docs and PEP 263:
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
....
Concerning the interpretation of encoding of files read by python, you can specify it explicitly in the open command
with open(fname, "rt", encoding="utf-8") as f:
...
and there's a more hackish way with some side effects, but saves you to explicitly specify it each time
import sys
# sys.setdefaultencoding() does not exist, here!
reload(sys) # Reload does the trick!
sys.setdefaultencoding('UTF8')
Please read the warnings about this hack in the related answer and comments.
The problem is with the print() expression, not with the decode() method.
If you look closely, the raised exception is a UnicodeEncodeError, not a -DecodeError.
Whenever you use the print() function, Python converts its arguments to a str and subsequently encodes the result to bytes, which are sent to the terminal (or whatever Python is run in).
The codec which is used for encoding (eg. UTF-8 or ASCII) depends on the environment.
In an ideal case,
the codec which Python uses is compatible with the one which the terminal expects, so the characters are displayed correctly (otherwise you get mojibake like "é" instead of "é");
the codec used covers a range of characters that is sufficient for your needs (such as UTF-8 or UTF-16, which contain all characters).
In your case, the second condition isn't met for the Linux docker you mention: the encoding used is ASCII, which only supports characters found on an old English typewriter.
These are a few options to address this problem:
Set environment variables: on Linux, Python's encoding defaults depend on this (at least partially). In my experience, this is a bit of a trial and error; setting LC_ALL to something containing "UTF-8" worked for me once. You'll have to put them in start-up script for the shell your terminal runs, eg. .bashrc.
Re-encode STDOUT, like so:
sys.stdout = open(sys.stdout.buffer.fileno(), 'w', encoding='utf8')
The encoding used has to match the one of the terminal.
Encode the strings yourself and send them to the binary buffer underlying sys.stdout, eg. sys.stdout.buffer.write("é".encode('utf8')). This is of course much more boilerplate than print("é"). Again, the encoding used has to match the one of the terminal.
Avoid print() altogether. Use open(fn, encoding=...) for output, the logging module for progress info – depending on how interactive your script is, this might be worthwhile (admittedly, you'll probably face the same encoding problem when writing to STDERR with the logging module).
There might be other options, but I doubt that there are nicer ones.

UnicodeEncodeError in Python on Windows Console

I'm having the following error while recursing the files in a directory and printing file names in the console:
Traceback (most recent call last):
File "C:\Program Files\Python33\lib\encodings\cp437.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_map)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u2013' in position
53: character maps to <undefined>
According to the error, one of the characters in the file name string is \u2013 which is an EN DASH – character different from the commonly seen - minus character.
I have checked my Windows encoding which is set to 437. Now, I see that I have two options to workaround this by either changing the encoding of Windows console or convert the characters in get from the file names to suit the console encoding. How would I go do that in Python 3.3?
Windows console is using cp437 encoding and there is a character \u2013 that isn't supported by that encoding. Try adding this to your code:
sys.stdout = io.TextIOWrapper(sys.stdout.buffer,'cp437','backslashreplace')
or convert the characters in get from the file names to suit the console encoding
Probably the console encoding is already correct (can't tell from the error message though). Code page 437 simply doesn't include that character so you won't be able to print it.
You can reopen stdout with a text encoder that has a fallback encoding, as demonstrated in iamsudip's answer which uses backslashreplace, to at least get readable (if not reliably recoverable) output instead of an error.
changing the encoding of Windows console
You can do this by executing the console command chcp 1252 before running Python, but that will still only give you a different limited repertoire of printable characters - including U+2013, but not many other Unicode characters.
In theory you can chcp to 65001 to get UTF-8 which would allow you to print any character. Unfortunately there are serious bugs in the C runtime's standard IO implementation, which usually make this unusable in practice.
This sorry state of affairs affects all applications that use the MS C runtime's stdio library calls, including Python and most other languages, with the result that Unicode on the Windows console just doesn't work in most cases.
If you really have to get Unicode out to the Windows console you can use the Win32 WriteConsoleW API directly using ctypes, but it's not much fun.

Unicode output in Python's stdout when running from cmd.exe [duplicate]

This question already has answers here:
Python, Unicode, and the Windows console
(15 answers)
Closed 9 years ago.
I am running Windows 7 and its console has been configured to use Consolas font, which gives me a possibility of Unicode output. The ability to read Unicode in console has been proved by me many times for programs such as Far Manager: both Cyrillics and German äöü letters can be read on the same console in the same string without encoding switching.
Now about Python.
I am trying very hard, but can't see Unicode in it's output.
By default print(sys.stdout.encoding) prints cp866 and stdout is unable to output any characters except ASCII and Cyrillics.
It gives me following results:
print("Ля-ля äöüÄÖÜß")
UnicodeEncodeError: 'charmap' codec can't encode characters in position 6-12: character maps to <undefined>
print("Ля-ля äöüÄÖÜß".encode("utf-8"))
b'\xd0\x9b\xd1\x8f-\xd0\xbb\xd1\x8f \xc3\xa4\xc3\xb6\xc3\xbc\xc3\x84\xc3\x96\xc3\x9c\xc3\x9f'
Ok, I've set the PYTHONIOENCODING environment variable in batch file:
SET PYTHONIOENCODING=UTF-8
and got:
print(sys.stdout.encoding)
UTF-8
print("Ля-ля äöüÄÖÜß")
╨Ы╤П-╨╗╤П ├д├╢├╝├Д├Ц├Ь├Я
print("Ля-ля äöüÄÖÜß".encode("utf-8"))`
b'\xd0\x9b\xd1\x8f-\xd0\xbb\xd1\x8f \xc3\xa4\xc3\xb6\xc3\xbc\xc3\x84\xc3\x96\xc3\x9c\xc3\x9f'
What to do?
Actually, there's a kinda bug in interaction between Python and Windows console (see http://bugs.python.org/issue1602). It is possible to read and write Unicode in Windows console using C functions ReadConsoleW, WriteConsoleW instead of ReadConsole and WriteConsole. So one seems-to-be-working solution is to write your own stdout and stdin object, calling ReadConsoleW, WriteConsoleW via ctypes. For output this works, but for input there's a problem that Python interactive interpreter actually doesn't use sys.stdin for getting input (but calling input() function works) – see http://bugs.python.org/issue17620.
Many people say that there's a problem with Windows console. But you can actually type Unicode characters (if you have proper keyboard layout) with no problem. These are displayed with no problem. You can even run file called “∫.py” with some Unicode arguments and it is correctly run and arguments are correclty waiting in sys.argv strings.
Update: I have built a Python package to deal with these issues. See https://github.com/Drekin/win-unicode-console and https://pypi.python.org/pypi/win_unicode_console. Install by pip install win_unicode_console. It works at least for me on Python 3.4, Python 3.5, and Python 2.7.

Writing binary data to stdout with IronPython

I have two Python scripts which I am running on Windows with IronPython 2.6 on .NET 2.0. One outputs binary data and the other processes the data. I was hoping to be able to stream data from the first to the second using pipes. The problem I encountered here is that, when run from the Windows command-line, sys.stdout uses CP437 character encoding and text mode instead of binary mode ('w' instead of 'wb'). This causes some bytes greater than 127 to be written as the wrong character (i.e., different byte values produce the same character in the output and are thus indistinguishable by the script reading them).
For example, this script prints the same character (an underscore) twice:
import sys
sys.stdout.write(chr(95))
sys.stdout.write(chr(222))
So when I try to read the data I get something different than what I originally wrote.
I wrote this script to check if the problem was writing in 'w' mode or the encoding:
import sys
str = chr(222)
# try writing chr(222) in ASCII in both write modes
# ASCII is the default encoding
open('ascii_w', 'w').write(str)
open('ascii_wb', 'wb').write(str)
# set encoding to CP437 and try writing chr(222) in both modes
reload(sys)
sys.setdefaultencoding("cp437")
open('cp437_w', 'w').write(str)
open('cp437_wb', 'wb').write(str)
After running that, the file cp437_w contains character 95 and the other three each contain character 222. Therefore, I believe that the problem is caused by the combination of CP437 encoding and writing in 'w' mode. In this case it would be solved if I could force stdout to use binary mode (I'm assuming that getting it to use ASCII encoding is impossible given that cmd.exe uses CP437). This is where I'm stuck; I can't find any way to do this.
Some potential solutions I found that didn't work:
running ipy -u doesn't seem to have any effect (I also tested to see if it would cause Unix-style newlines to be printed; it doesn't, so I suspect that -u doesn't work with IronPython at all)
I can't use this solution because msvcrt is not supported in IronPython
with Python 3.x you can access unbuffered stdout through sys.stdout.buffer; this isn't available in 2.6
os.fdopen(sys.stdout.fileno(), 'wb', 0) just returns stdout in 'w' mode
So yeah, any ideas? Also, if there's a better way of streaming binary data that doesn't use stdout, I'm certainly open to suggestions.
sys.stdout is just a variable that points to the same thing as sys.__stdout__
Therefore, just open up a file in binary mode, assign the file to sys.stdout and use it. If you ever need the real, normal stdout back again, you can get it with
sys.stdout = sys.__stdout__

What encoding do I need to display a GBP sign (pound sign) using python on cygwin in Windows XP?

I have a python (2.5.4) script which I run in cygwin (in a DOS box on Windows XP). I want to include a pound sign (£) in the output. If I do so, I get this error:
SyntaxError: Non-ASCII character '\xa3' in file dbscan.py on line 253, but no encoding declared; see http://www.python.org/peps/pep-0263.html for details
OK. So I looked at that PEP, and now tried adding this to the beginning of my script:
# coding=cp437
That stopped the error, but the output shows ú where it should show £.
I've tried ISO-8859-1 as well, with the same result.
Does anyone know which encoding I need?
Or where I could look to find out?
The Unicode for a pound sign is 163 (decimal) or A3 in hex, so the following should work regardless of the encoding of your script, as long as the output encoding is working correctly.
print u"\xA3"
try the encoding :
# -*- coding: utf-8 -*-
and then to display the '£' sign:
print unichr(163)
There are two encodings involved here:
The encoding of your source code, which must be correct in order for your input file to mean what you think it means
The encoding of the output, which must be correct in order for the symbols emitted to display as expected.
It seems your output encoding is off now. If this is running in a terminal window in Cygwin, it is that terminal's encoding that you need to match.
EDIT: I just ran the following Python program in a (native) Windows XP terminal window, thought it was slightly interesting:
>>> ord("£")
156
156 is certainly not the codepoint for the pound sign in the Latin1 encoding you tried. It doesn't seem to be in Window's Codepage 1252 either, which I would expect my terminal to use ... Weird.

Categories