Writing unicode strings via sys.stdout in Python - python

Assume for a moment that one cannot use print (and thus enjoy the benefit of automatic encoding detection). So that leaves us with sys.stdout. However, sys.stdout is so dumb as to not do any sensible encoding.
Now one reads the Python wiki page PrintFails and goes to try out the following code:
$ python -c 'import sys, codecs, locale; print str(sys.stdout.encoding); \
sys.stdout = codecs.getwriter(locale.getpreferredencoding())(sys.stdout);
However this too does not work (at least on Mac). Too see why:
>>> import locale
>>> locale.getpreferredencoding()
'mac-roman'
>>> sys.stdout.encoding
'UTF-8'
(UTF-8 is what one's terminal understands).
So one changes the above code to:
$ python -c 'import sys, codecs, locale; print str(sys.stdout.encoding); \
sys.stdout = codecs.getwriter(sys.stdout.encoding)(sys.stdout);
And now unicode strings are properly sent to sys.stdout and hence printed properly on the terminal (sys.stdout is attached the terminal).
Is this the correct way to write unicode strings in sys.stdout or should I be doing something else?
EDIT: at times--say, when piping the output to less--sys.stdout.encoding will be None. in this case, the above code will fail.

export PYTHONIOENCODING=utf-8
will do the job, but can't set it on python itself ...
what we can do is verify if isn't setting and tell the user to set it before call script with :
if __name__ == '__main__':
if (sys.stdout.encoding is None):
print >> sys.stderr, "please set python env PYTHONIOENCODING=UTF-8, example: export PYTHONIOENCODING=UTF-8, when write to stdout."
exit(1)

Best idea is to check if you are directly connected to a terminal. If you are, use the terminal's encoding. Otherwise, use system preferred encoding.
if sys.stdout.isatty():
default_encoding = sys.stdout.encoding
else:
default_encoding = locale.getpreferredencoding()
It's also very important to always allow the user specify whichever encoding she wants. Usually I make it a command-line option (like -e ENCODING), and parse it with the optparse module.
Another good thing is to not overwrite sys.stdout with an automatic encoder. Create your encoder and use it, but leave sys.stdout alone. You could import 3rd party libraries that write encoded bytestrings directly to sys.stdout.

There is an optional environment variable "PYTHONIOENCODING" which may be set to a desired default encoding. It would be one way of grabbing the user-desired encoding in a way consistent with all of Python. It is buried in the Python manual here.

This is what I am doing in my application:
sys.stdout.write(s.encode('utf-8'))
This is the exact opposite fix for reading UTF-8 names from argv:
for file in sys.argv[1:]:
file = file.decode('utf-8')
This is very ugly (IMHO) as it force you to work with UTF-8.. which is the norm on Linux/Mac, but not on windows... Works for me anyway :)

It's not clear to my why you wouldn't be able to do print; but assuming so, yes, the approach looks right to me.

Related

How to print unicode to both terminal and file redirect

I read everything there is to read about Unicode, UTF-8, encoding/decoding and everything, but I still strugle.
I made a short example snippet to illustrate my problem.
I want to print the string 'Geïrriteerd' just like it is written here. I need to use the following code to let it print properly to a file if I run it with a redirect to a file, like 'Test.py > output'
# coding=utf-8
import codecs
import sys
sys.stdout = codecs.getwriter('UTF-8')(sys.stdout)
print u'Geïrriteerd'
But if I do NOT redirect, the code above prints 'Geïrriteerd' to the terminal.
If I remove the 'codecs.getwriter' line, it prints fine again to the terminal but will print 'Geïrriteerd' to the file.
How can I get this to print properly in both cases?
I am using Python 2.7 on Windows 10. I know Python 3.x handles unicode better in general, but I can't use that in my project (yet) due to other dependencies.
Since redirection is a shell operation, it makes sense to control the encoding using the shell as well. Fortunately, Python provides an environment variable to control the encoding. Given test.py:
#!python2
# coding=utf-8
print u'Geïrriteerd'
To redirect to a file with a particular encoding, use:
C:\>set PYTHONIOENCODING=utf8
C:\>test >out.txt
Running the script normally with PYTHONIOENCODING undefined will use the encoding of the terminal (in my case cp437):
C:\>set PYTHONIOENCODING=
C:\>test
Geïrriteerd
Your terminal is set up for cp850 instead of UTF-8.
Run chcp 65001.
http://enwp.org/Chcp_(command)
http://enwp.org/Windows_code_page#List
You need to "encode" your unicode first to write to file or display. You do not really need the codecs module.
The docs provide really good examples for working with unicode.
print type(u'Geïrriteerd')
print type(u'Geïrriteerd'.encode('utf-8'))
print u'Geïrriteerd'.encode('utf-8')
with open('test.txt', 'wb') as f:
f.write(u'Geïrriteerd'.encode('utf-8'))
with open('test.txt', 'r') as f:
content = f.read()
print content
#If you want to use codecs still
import codecs
with codecs.open("test.txt", "w", encoding="utf-8") as f:
f.write(u'Geïrriteerd')
with open('test.txt', 'r') as f:
content = f.read()
print content

Python3 print() Vs Python2 print

While working on a buffer overflow exploit I found something really strange. I have successfully found that I need to provide 32 characters before the proper address I want to jump to and that the proper address is 0x08048a37. When I executed
python -c "print '-'*32+'\x37\x8a\x04\x08'" | ./MyExecutable
the exploit resulted in a success. But, when I tried:
python3 -c "print('-'*32+'\x37\x8a\x04\x08')" | ./MyExecutable
it didn't. The executable simply resulted in a Segmentation Fault without jumping to the desired address. In fact, executing
python -c "print '-'*32+'\x37\x8a\x04\x08'"
and
python3 -c "print('-'*32+'\x37\x8a\x04\x08')"
results in two different output on the console. The characters are, of course, not readable but they're visually different.
I wonder why is this happening?
The Python 2 code writes bytes, the Python 3 code writes text that is then encoded to bytes. The latter will thus not write the same output; it depends on the codec configured for your pipe.
In Python 3, write bytes to the sys.stdout.buffer object instead:
python3 -c "import sys; sys.stdout.buffer.write(b'-'*32+b'\x37\x8a\x04\x08')"
You may want to manually add the \n newline that print would add.
sys.stdout is a io.TextIOBase object, encoding data written to it to a given codec (usually based on your locale, but when using a pipe, often defaulting to ASCII), before passing it on to the underlying buffer object. The TextIOBase.buffer attribute gives you direct access to the underlying BufferedIOBase object.

How to write utf8 to standard output in a way that works with python2 and python3

I want to write a non-ascii character, lets say → to standard output. The tricky part seems to be that some of the data that I want to concatenate to that string is read from json. Consider the follwing simple json document:
{"foo":"bar"}
I include this because if I just want to print → then it seems enough to simply write:
print("→")
and it will do the right thing in python2 and python3.
So I want to print the value of foo together with my non-ascii character →. The only way I found to do this such that it works in both, python2 and python3 is:
getattr(sys.stdout, 'buffer', sys.stdout).write(data["foo"].encode("utf8")+u"→".encode("utf8"))
or
getattr(sys.stdout, 'buffer', sys.stdout).write((data["foo"]+u"→").encode("utf8"))
It is important to not miss the u in front of → because otherwise a UnicodeDecodeError will be thrown by python2.
Using the print function like this:
print((data["foo"]+u"→").encode("utf8"), file=(getattr(sys.stdout, 'buffer', sys.stdout)))
doesnt seem to work because python3 will complain TypeError: 'str' does not support the buffer interface.
Did I find the best way or is there a better option? Can I make the print function work?
The most concise I could come up with is the following, which you may be able to make more concise with a few convenience functions (or even replacing/overriding the print function):
# -*- coding=utf-8 -*-
import codecs
import os
import sys
# if you include the -*- coding line, you can use this
output = 'bar' + u'→'
# otherwise, use this
output = 'bar' + b'\xe2\x86\x92'.decode('utf-8')
if sys.stdout.encoding == 'UTF-8':
print(output)
else:
output += os.linesep
if sys.version_info[0] >= 3:
sys.stdout.buffer.write(bytes(output.encode('utf-8')))
else:
codecs.getwriter('utf-8')(sys.stdout).write(output)
The best option is using the -*- encoding line, which allows you to use the actual character in the file. But if for some reason, you can't use the encoding line, it's still possible to accomplish without it.
This (both with and without the encoding line) works on Linux (Arch) with python 2.7.7 and 3.4.1.
It also works if the terminal's encoding is not UTF-8. (On Arch Linux, I just change the encoding by using a different LANG environment variable.)
LANG=zh_CN python test.py
It also sort of works on Windows, which I tried with 2.6, 2.7, 3.3, and 3.4. By sort of, I mean I could get the '→' character to display only on a mintty terminal. On a cmd terminal, that character would display as 'ΓåÆ'. (There may be something simple I'm missing there.)
If you don't need to print to sys.stdout.buffer, then the following should print fine to sys.stdout. I tried it in both Python 2.7 and 3.4, and it seemed to work fine:
# -*- coding=utf-8 -*-
print("bar" + u"→")

Python pipe cp1252 string from PowerShell to a python (2.7) script

After a few days of dwelling over stackoverflow and python 2.7 doc, I have come to no conclusion about this.
Basically I'm running a python script on a windows server that must have as input a block of text. This block of text (unfortunately) has to be passed by a pipe. Something like:
PS > [something_that_outputs_text] | python .\my_script.py
So the problem is:
The server uses cp1252 encoding and I really cannot change it due to administrative regulations and whatnot. And when I pipe the text to my python script, when I read it, it comes already with ? whereas characters like \xe1 should be.
What I have done so far:
Tested with UTF-8. Yep, chcp 65001 and $OutputEncoding = [Console]::OutputEncoding "solve it", as in python gets the text perfectly and then I can decode it to unicode etc. But apparently they don't let me do it on the server /sadface.
A little script to test what the hell is happening:
import codecs
import sys
def main(argv=None):
if argv is None:
argv = sys.argv
if len(argv)>1:
for arg in argv[1:]:
print arg.decode('cp1252')
sys.stdin = codecs.getreader('cp1252')(sys.stdin)
text = sys.stdin.read().strip()
print text
return 0
if __name__=="__main__":
sys.exit(main())
Tried it with both the codecs wrapping and without it.
My input & output:
PS > echo "Blá" | python .\testinput.py blé
blé
Bl?
--> So there's no problem with the argument (blé) but the piped text (Blá) is no good :(
I even converted the text string to hex and, yes, it gets flooded with 3f (AKA mr ?), so it's not a problem with the print.
[Also: it's my first question here... feel free to ask any more info about what I did]
EDIT
I don't know if this is relevant or not, but when I do sys.stdin.encoding it yields None
Update: So... I have no problems with cmd. Checked sys.stdin.encoding while running the program on cmd and everything went fine. I think my head just exploded.
How about saving the data into a file and piping it to Python on a CMD session? Invoke Powershell and Python on CMD. Like so,
c:\>powershell -command "c:\genrateDataForPython.ps1 -output c:\data.txt"
c:\>type c:\data.txt | python .\myscript.py
Edit
Another an idea: convert the data into base64 format in Powershell and decode it in Python. Base64 is simple in Powershell, I guess in Python it isn't hard either. Like so,
# Convert some accent chars to base64
$s = [Text.Encoding]::UTF8.GetBytes("éêèë")
[System.Convert]::ToBase64String($s)
# Output:
w6nDqsOow6s=
# Decode:
$d = [System.Convert]::FromBase64String("w6nDqsOow6s=")
[Text.Encoding]::UTF8.GetString($d)
# Output
éêèë

How to write binary data to stdout in python 3?

In python 2.x I could do this:
import sys, array
a = array.array('B', range(100))
a.tofile(sys.stdout)
Now however, I get a TypeError: can't write bytes to text stream. Is there some secret encoding that I should use?
A better way:
import sys
sys.stdout.buffer.write(b"some binary data")
An idiomatic way of doing so, which is only available for Python 3, is:
with os.fdopen(sys.stdout.fileno(), "wb", closefd=False) as stdout:
stdout.write(b"my bytes object")
stdout.flush()
The good part is that it uses the normal file object interface, which everybody is used to in Python.
Notice that I'm setting closefd=False to avoid closing sys.stdout when exiting the with block. Otherwise, your program wouldn't be able to print to stdout anymore. However, for other kind of file descriptors, you may want to skip that part.
import os
os.write(1, a.tostring())
or, os.write(sys.stdout.fileno(), …) if that's more readable than 1 for you.
In case you would like to specify an encoding in python3 you can still use the bytes command like below:
import os
os.write(1,bytes('Your string to Stdout','UTF-8'))
where 1 is the corresponding usual number for stdout --> sys.stdout.fileno()
Otherwise if you don't care of the encoding just use:
import sys
sys.stdout.write("Your string to Stdout\n")
If you want to use the os.write without the encoding, then try to use the below:
import os
os.write(1,b"Your string to Stdout\n")

Categories