Writing binary data to stdout with IronPython - python

I have two Python scripts which I am running on Windows with IronPython 2.6 on .NET 2.0. One outputs binary data and the other processes the data. I was hoping to be able to stream data from the first to the second using pipes. The problem I encountered here is that, when run from the Windows command-line, sys.stdout uses CP437 character encoding and text mode instead of binary mode ('w' instead of 'wb'). This causes some bytes greater than 127 to be written as the wrong character (i.e., different byte values produce the same character in the output and are thus indistinguishable by the script reading them).
For example, this script prints the same character (an underscore) twice:
import sys
sys.stdout.write(chr(95))
sys.stdout.write(chr(222))
So when I try to read the data I get something different than what I originally wrote.
I wrote this script to check if the problem was writing in 'w' mode or the encoding:
import sys
str = chr(222)
# try writing chr(222) in ASCII in both write modes
# ASCII is the default encoding
open('ascii_w', 'w').write(str)
open('ascii_wb', 'wb').write(str)
# set encoding to CP437 and try writing chr(222) in both modes
reload(sys)
sys.setdefaultencoding("cp437")
open('cp437_w', 'w').write(str)
open('cp437_wb', 'wb').write(str)
After running that, the file cp437_w contains character 95 and the other three each contain character 222. Therefore, I believe that the problem is caused by the combination of CP437 encoding and writing in 'w' mode. In this case it would be solved if I could force stdout to use binary mode (I'm assuming that getting it to use ASCII encoding is impossible given that cmd.exe uses CP437). This is where I'm stuck; I can't find any way to do this.
Some potential solutions I found that didn't work:
running ipy -u doesn't seem to have any effect (I also tested to see if it would cause Unix-style newlines to be printed; it doesn't, so I suspect that -u doesn't work with IronPython at all)
I can't use this solution because msvcrt is not supported in IronPython
with Python 3.x you can access unbuffered stdout through sys.stdout.buffer; this isn't available in 2.6
os.fdopen(sys.stdout.fileno(), 'wb', 0) just returns stdout in 'w' mode
So yeah, any ideas? Also, if there's a better way of streaming binary data that doesn't use stdout, I'm certainly open to suggestions.

sys.stdout is just a variable that points to the same thing as sys.__stdout__
Therefore, just open up a file in binary mode, assign the file to sys.stdout and use it. If you ever need the real, normal stdout back again, you can get it with
sys.stdout = sys.__stdout__

Related

json.dump() uses ASCII codec encoding (instead of requested UTF-8) when redirecting stdout to a file

This tiny python program:
#!/usr/bin/env python
# -*- coding: utf8 -*-
import json
import sys
x = { "name":u"This doesn't work β" }
json.dump(x, sys.stdout, ensure_ascii=False, encoding="utf8")
print
Generates this output when run at a terminal:
$ ./tester.py
{"name": "This doesn't work β"}
Which is exactly as I would expect. However, if I redirect stdout to a file, it fails:
$ ./tester.py > output.json
Traceback (most recent call last):
File "./tester.py", line 9, in <module>
json.dump(x, sys.stdout, ensure_ascii=False, encoding="utf8")
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/__init__.py", line 190, in dump
fp.write(chunk)
UnicodeEncodeError: 'ascii' codec can't encode character u'\u03b2' in position 19: ordinal not in range(128)
However, a direct print (without json.dump) can can be redirect to file:
print u"This does work β".encode('utf-8')
It's as if the json package ignores the encoding option if stdout is not a terminal.
How can I get the json package to do what I want?
JSON is a text serialization format (that incidentally has a recommended binary encoding), not a binary serialization format. The json module itself only cares about encoding to the extent that it would like to know what Python 2's terrible str type is supposed to represent (is it ASCII bytes? UTF-8 bytes? latin-1 bytes?).
Since Python 2 text handling is, as stated, terrible, the json module is happy to return either str (when ensure_ascii is true, or the stars align in other cases and it's convinced you've told it str is compatible with your expected encoding, and none of the inputs are actually unicode) or unicode (when ensure_ascii is false, most of the time).
Like the rest of Python 2, sys.stdout is a bit wishy-washy. Even if it is set to an encoding='ascii' by your locale settings, it ignores it when you write a str to it (sys.stdout.write('\xe9') should fail, but instead, it treats the str as pre-encoded raw binary data and doesn't bother to verify it matches the expected encoding. But when unicode comes in, it doesn't have that option; unicode is text (not UTF-8 text, not ASCII text, etc.), from the ideal text world of unicorns and rainbows, and that world isn't expressed in tawdry bytes.
So sys.stdout must encode the result, and it does so with the locale determined encoding (sys.stdout.encoding will tell you what it is). When that's ASCII, and it receives something that can't encode to ASCII, it explodes (as it should).
The point is, the json module is always returning text (either unicode, or str that it's convinced is effectively text in the wishy-washy Python 2 world), and sometimes you get lucky and that text happens to be in a format that bypasses checks in sys.stdout.
But you shouldn't be relying on that. If your output must be in a specific encoding, use that encoding. The simplest way to do this (simplest in the sense that it pushes most work to the interpreter to do for you) is to not use sys.stdout (explicitly, or implicitly via print) and write your data to files you open with io.open (a backport of Python 3's open, that properly handles encodings), explicitly specifying encoding='utf-8'. If you must use sys.stdout, and you insist on ignoring the locale encoding, you can rewrap it, e.g.:
with io.open(sys.stdout.fileno(), encoding='utf-8', closefd=False) as encodedout:
json.dump(x, encodedout, ensure_ascii=False, encoding="utf-8")
which temporarily wraps the stdout file descriptor in a modern file-like object (using closefd to avoid closing the underlying descriptor when it's closed).
TL;DR: Switch to Python 3. Python 2 is awful when it comes to non-ASCII text, and its modules are often even worse (json should absolutely be returning a consistent type, or at least just one type for each setting of ensure_ascii, not dynamically selecting based on the inputs and encoding; it's not even the worst either, the csv module is absolutely awful). Also, it's reached end-of-life, and will not be patched for anything from here on out, so continuing to use it leaves you vulnerable to any security problems found between the beginning of this year and the end of time. Among other things, Python 3 uses str exclusively for text (which has the full Unicode support of Py2's unicode type) and modern Python 3 (3.7+) will coerce ASCII locales to UTF-8 (because basically all systems can actually handle the latter), which should fix all your problems. Non-ASCII text will behave the same as ASCII text, and weirdo locales like yours that insist they're ASCII (and therefore won't handle non-ASCII output) will be "fixed" to work as you desire, without manually encoding and decoding, rewrapping file handles, etc.
Consolidating all the comments and answers into one final answer:
Note: this answer is for Python 2.7. Python 3 is likely to be different.
The json spec says that json files are utf-8 encoded. However, the Python json package does not like to take chances and so writes straight ascii and escapes unicode characters in the output.
You can set the ensure_ascii flag to False, in which case the json package will generate unicode output instead of str. In that case, encoding the unicode output is your problem.
There is no way to make the json package generate utf-8 or any other encoding on output. It's either ascii or unicode; take your pick.
The encoding argument was a red herring. That option tells the json package how the input strings are encoded.
Here's what finally worked for me:
ofile = codecs.getwriter('utf-8')(sys.stdout)
json.dump(x, ofile, ensure_ascii=False)
tl;dr: the real mystery was why didn't it barf when just letting stdout go to the terminal. It turned out that stdout.write() was detecting when output was to a terminal and encoding per the $LANG environment variable. When output goes to a file, the unicode is encoded to ascii, and an error results when a non-encodable character is encountered.
There is an environment variable Python uses that can override encoding to the terminal or for redirection, so this should work without wrapping stdout inside the script.
$ export PYTHONIOENCODING=utf8
$ ./tester.py > output.json

Python Script Called in Powershell Fails to Write to Stdout when Piped to File

So I'm attempting to chain a couple scripts together, some in powershell (5.1), some in python (3.7).
The script that I am having trouble with is written in python, and writes to stdout via sys.stdout.write(). This script reads in a file, completes some processing, and then outputs the result.
When this script is called by itself, that is to say no output to any pipe, it properly executes and writes to the standard powershell console. However, as soon as I attempt to pipe the output in any fashion I start to get errors.
In particular, two files have the character \u200b, or a zero-width-space. Printing the output of these characters to the console is fine, but attempting to redirect the output to a file via a variety of methods:
py ./script.py input.txt > output.txt
py ./script.py input.txt | Set-Content -Encoding utf8 output.txt
Start-Process powershell -RedirectStandardOutput "output.txt" -Argumentlist "py", "./script.py", "input.txt"
$PSDefaultParameterValues['Out-File:Encoding'] = 'utf8'
all fail with:
File "\Python\Python37\lib\encodings\cp1252.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u200b' in position 61: character maps to <undefined>
On the python side, modifying the script to remove all non-UTF-8 characters also causes this script to fail, so I am a bit stuck. I am currently thinking that the issue is occurring due to how the piped output is causing python to set a different environment, though I am not sure how such modifications could be made within the python code.
For completeness sake, here is the function writing the output. (Note: file_lines is a list of strings):
import sys
def write_lines(file_lines):
for line in file_lines:
line = list(map(lambda x: '"' + x + '"', line))
line = "".join(entry + ',' for entry in line)
if not line is None:
sys.stdout.write(line + "\n")
The root cause is with the way python handles STDOUT. Python does some low level detection to get the encoding of the system and then uses a io.TextIOWrapper with the encoding set to what it detects and that's what you get in sys.stdout (stderr and stdin have the same).
Now, this detection returns UTF-8 when running in the shell because powershell works in UTF-8 and puts a layer of translation between the system and the running program but when piping to another program the communication is direct without the powershell translation, this direct communication uses the system's encoding which for windows is cp1252 (AKA Windows-1252).
system <(cp1252)> posh <(utf-8)> python # here stdout returns to the shell
system <(cp1252)> posh <(utf-8)> python <(cp1252)> pipe| or redirect> # here stdout moves directly to the next program
As for your issue, without looking at the rest of your program and the input file my best guess is some encoding mismatch, most likely in the reading of the input file, by default python 3+ will read files in utf-8, if this file is in some other encoding you get errors, best case scenario you get garbage text, worst you get an encoding exception.
To solve it you need to know which encoding your input file was created with, which may get tricky and detection is usually slow, other solution would be to work with the files in bytes but this may not be possible depending on the processing done.

Python3 utf-8 decode issue

The following code runs fine with Python3 on my Windows machine and prints the character 'é':
data = b"\xc3\xa9"
print(data.decode('utf-8'))
However, running the same on an Ubuntu based docker container results in :
UnicodeEncodeError: 'ascii' codec can't encode character '\xe9' in position 0: ordinal not in range(128)
Is there anything that I have to install to enable utf-8 decoding ?
Seems ubuntu - depending on version - uses one encoding or another as default, and it may vary between shell and python as well. Adopted from this posting and also this blog:
Thus the recommended way seems to be to tell your python instance to use utf-8 as default encoding:
Set your default encoding of python source files via environment variable:
export PYTHONIOENCODING=utf8
Also, in your source files you can state the encoding you prefer to be used explicitly, so it should work irrespective of environment setting (see this question + answer, python docs and PEP 263:
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
....
Concerning the interpretation of encoding of files read by python, you can specify it explicitly in the open command
with open(fname, "rt", encoding="utf-8") as f:
...
and there's a more hackish way with some side effects, but saves you to explicitly specify it each time
import sys
# sys.setdefaultencoding() does not exist, here!
reload(sys) # Reload does the trick!
sys.setdefaultencoding('UTF8')
Please read the warnings about this hack in the related answer and comments.
The problem is with the print() expression, not with the decode() method.
If you look closely, the raised exception is a UnicodeEncodeError, not a -DecodeError.
Whenever you use the print() function, Python converts its arguments to a str and subsequently encodes the result to bytes, which are sent to the terminal (or whatever Python is run in).
The codec which is used for encoding (eg. UTF-8 or ASCII) depends on the environment.
In an ideal case,
the codec which Python uses is compatible with the one which the terminal expects, so the characters are displayed correctly (otherwise you get mojibake like "é" instead of "é");
the codec used covers a range of characters that is sufficient for your needs (such as UTF-8 or UTF-16, which contain all characters).
In your case, the second condition isn't met for the Linux docker you mention: the encoding used is ASCII, which only supports characters found on an old English typewriter.
These are a few options to address this problem:
Set environment variables: on Linux, Python's encoding defaults depend on this (at least partially). In my experience, this is a bit of a trial and error; setting LC_ALL to something containing "UTF-8" worked for me once. You'll have to put them in start-up script for the shell your terminal runs, eg. .bashrc.
Re-encode STDOUT, like so:
sys.stdout = open(sys.stdout.buffer.fileno(), 'w', encoding='utf8')
The encoding used has to match the one of the terminal.
Encode the strings yourself and send them to the binary buffer underlying sys.stdout, eg. sys.stdout.buffer.write("é".encode('utf8')). This is of course much more boilerplate than print("é"). Again, the encoding used has to match the one of the terminal.
Avoid print() altogether. Use open(fn, encoding=...) for output, the logging module for progress info – depending on how interactive your script is, this might be worthwhile (admittedly, you'll probably face the same encoding problem when writing to STDERR with the logging module).
There might be other options, but I doubt that there are nicer ones.

chcp 65001 codepage results in program termination without any error

Problem
The problem arises when I want to input Unicode character in Python interpreter (for simplicity I have used a-umlaut in the example, but I have first encountered this for Farsi characters). Whenever I use python with chcp 65001 code page and then try to input even one Unicode character, Python exits without any error.
I have spent days trying to solve this problem to no avail. But today, I found a thread on python website, another on MySQL and another on Lua-users which issues were raised regarding this sudden exit, although without any solution and some saying that chcp 65001 is inherently broken.
It would be good to know once and for all whether this problem is chcp-design-related or there is a possible workaround.
Reproduce Error
chcp 65001
Python 3.X:
Python shell
print('ä')
result: it just exits the shell
however, this works python.exe -c "print('ä')"
and also this : print('\u00e4')
result: ä
in Luajit2.0.4
print('ä')
result: it just exits the shell
however this works: print('\xc3\xa4')
I have come up with this observation so far:
direct output with the command prompt works.
Unicode-based , hex-based equivalent of the character works.
So
This is not a Python bug and that we can't use a Unicode character directly in CLI programs in Windows command prompt or any of its Wrapper like Conemu, Cmder (I am using Cmder to be able to see and use Unicode character in Windows shell and I have done so without any problem). Is this correct?
To use Unicode in the Windows console for Python 2.7 and 3.x (prior to 3.6), install and enable win_unicode_console. This uses the wide-character functions ReadConsoleW and WriteConsoleW, just like other Unicode-aware console programs such as cmd.exe and powershell.exe. For Python 3.6, a new io._WindowsConsoleIO raw I/O class has been added. It reads and writes UTF-8 encoded text (for cross-platform compatibility with Unix -- "get a byte" -- programs), but internally it uses the wide-character API by transcoding to and from UTF-16LE.
The problem you're experiencing with non-ASCII input is reproducible in the console for all Windows versions up to and including Windows 10. The console host process, i.e. conhost.exe, wasn't designed for UTF-8 (codepage 65001) and hasn't been updated to support it consistently. In particular, non-ASCII input causes an empty read. This in turn causes Python's REPL to exit and built-in input to raise EOFError.
The problem is that conhost encodes its UTF-16 input buffer assuming a single-byte codepage, such as the OEM and ANSI codepages in Western locales (e.g. 437, 850, 1252). UTF-8 is a multibyte encoding in which non-ASCII characters are encoded as 2 to 4 bytes. To handle UTF-8 it would need to encode in multiple iterations of M / 4 characters, where M is the remaining bytes available from the N-byte buffer. Instead it assumes a request to read N bytes is a request to read N characters. Then if the input has one or more non-ASCII characters, the internal WideCharToMultiByte call fails due to an undersized buffer, and the console returns a 'successful' read of 0 bytes.
You may not observe exactly this problem in Python 3.5 if the pyreadline module is installed. Python 3.5 automatically tries to import readline. In the case of pyreadline, input is read via the wide-character function ReadConsoleInputW. This is a low-level function to read console input records. In principle it should work, but in practice entering print('ä') gets read by the REPL as print(''). For a non-ASCII character, ReadConsoleInputW returns a sequence of Alt+Numpad KEY_EVENT records. The sequence is a lossy OEM encoding, which can be ignored except for the last record, which has the input character in the UnicodeChar field. Apparently pyreadline ignores the entire sequence.
Prior to Windows 8, output using codepage 65001 is also broken. It prints a trail of garbage text in proportion to the number of non-ASCII characters. In this case the problem is that WriteFile and WriteConsoleA incorrectly return the number of UTF-16 codes written to the screen buffer instead of the number of UTF-8 bytes. This confuses Python's buffered writer, leading to repeated writes of what it thinks are the remaining unwritten bytes. This problem was fixed in Windows 8 as part of rewriting the internal console API to use the ConDrv device instead of an LPC port. Older versions of Windows can use ConEmu or ANSICON to work around this bug.

Make python3 default to latin-1 for a script?

TL;DR: Can I make Python 3 use anything other than unicode as default encoding for everything?
I have some scripts written in Python 3. While operating on my own files, they worked fine, because the files where encoded in utf-8 and usually using only the ASCII-compatible subset anyway.
Now I tried using the same scripts on decades-old source files and I get unicode exceptions left and right. It is entirely possible, that the files have been edited with editors assuming different encodings over the year, so the encoding of each file may differ or even be ill-defined.
If I had written my scripts in Python 2, which assumes a fixed-width encoding, everything would work fine. The parts using non-ascii characters are only in comments anyway.
In Python3 the clean solution when encodings are unknown and possibly ill-defined would be to operate only on byte-array data, but the absence of a .format function and the need to distinguish between bytes and str literals everywhere is both a syntactics nightmare and too time-consuming to fix across my scripts to be worthwhile.
Is it possible to change the assumed default encoding of sys.stdin, sys.stderr, and all files opened without explicit encoding to a fixed-width encoding? Doing so would allow my scripts to work as "bytes in, bytes out", which would really fit my use of shell scripts better (and would ultimately be more stable).
Ideally the solution should be possible on a per-script basis and allow ignoring environment variables.
The best I could come up with based on https://stackoverflow.com/a/12823030/2075630 is
sys.stdout = io.TextIOWrapper(sys.stdout.buffer, encoding="latin-1")
sys.stderr = io.TextIOWrapper(sys.stderr.buffer, encoding="latin-1")
sys.stdin = io.TextIOWrapper(sys.stdin.buffer, encoding="latin-1")
# To avoid changing individual `open` calls:
open_ = open
def open(*a,**b):
b["encoding"] = "latin-1"
return open_(*a,**b)
but this causes the STDOUT and STDERR streams to be heavily buffered, which is undesirable for shell-scripts.
Python 2 doesn't assume any encoding. It basically operates on bytes. Read your files in binary mode and process bytes to go back to that mode.
You can treat the STDIO streams as binary by accessing the .buffer attribute:
bytes_from_stdin = sys.stdin.buffer.read()
sys.stdout.buffer.write(bytes_to_stdout)
Add 'b' to the file mode to open files in binary mode.
Normally the codec picked for STDIO encoding / decoding is based on the current locale of your terminal where you are running the script. To switch codecs, you can switch locale in your terminal, or set one for just Python by setting the PYTHONIOENCODING environment variable:
PYTHONIOENCODING=latin1 ./yourscript.py
Text files should always be opened using an explicit codec; don't rely on the system default. I'm not sure that patching out open() is the best path to do that though.
The buffering issue with TextIOWrapper() can be remedied by enabling line buffering; an implicit buffer.flush() call is executed every time a \n newline is written to the wrapper if you set line_buffering=True:
sys.stdout = io.TextIOWrapper(
sys.stdout.buffer, encoding="latin-1", line_buffering=True)

Categories