How to write binary data to stdout in python 3? - python

In python 2.x I could do this:
import sys, array
a = array.array('B', range(100))
a.tofile(sys.stdout)
Now however, I get a TypeError: can't write bytes to text stream. Is there some secret encoding that I should use?

A better way:
import sys
sys.stdout.buffer.write(b"some binary data")

An idiomatic way of doing so, which is only available for Python 3, is:
with os.fdopen(sys.stdout.fileno(), "wb", closefd=False) as stdout:
stdout.write(b"my bytes object")
stdout.flush()
The good part is that it uses the normal file object interface, which everybody is used to in Python.
Notice that I'm setting closefd=False to avoid closing sys.stdout when exiting the with block. Otherwise, your program wouldn't be able to print to stdout anymore. However, for other kind of file descriptors, you may want to skip that part.

import os
os.write(1, a.tostring())
or, os.write(sys.stdout.fileno(), …) if that's more readable than 1 for you.

In case you would like to specify an encoding in python3 you can still use the bytes command like below:
import os
os.write(1,bytes('Your string to Stdout','UTF-8'))
where 1 is the corresponding usual number for stdout --> sys.stdout.fileno()
Otherwise if you don't care of the encoding just use:
import sys
sys.stdout.write("Your string to Stdout\n")
If you want to use the os.write without the encoding, then try to use the below:
import os
os.write(1,b"Your string to Stdout\n")

Related

How to append the Output of a command to temporary file

I have been trying to append my output of a command to a temporary file in python and later doing some operations. Not able to append the data to a temporary file. Any help is appreciated! My sample code as follows.
Getting the error like this.
with open(temp1 , 'r') as f:
TypeError: expected str, bytes or os.PathLike object, not _TemporaryFileWrapper
import tempfile
import os
temp1 = tempfile.NamedTemporaryFile()
os.system("echo Hello world | tee temp1")
with open(temp1 , 'r') as f:
a = f.readlines()[-1]
print(a)
import tempfile
import os
# Opening in update-text mode to avoid encoding the data written to it
temp1 = tempfile.NamedTemporaryFile("w+")
# popen opens a pipe from the command, allowing one to capture its output
output = os.popen("echo Hello world")
# Write the command output to the temporary file
temp1.write(output.read())
# Reset the stream position at the beginning of the file, if you want to read its contents
temp1.seek(0)
print(temp1.read())
Check out subprocess.Popen for more powerful subprocess communication.
Whatever you're trying to do isn't right. It appears that you are trying to have a system call write to a file, and then you want to read that file in your Python code. You're creating a temporary file, but then your system call is writing to a statically named file, named 'temp1' rather than to the temporary file you've opened. So it's unclear if you want/need to use a computed temporary file name or if using temp1 is OK. The easiest way to fix your code to do what I think you want is like this:
import os
os.system("echo Hello world | tee temp1")
with open('temp1' , 'r') as f:
a = f.readlines()[-1]
print(a)
If you need to create a temporary file name in your situation, then you have to be careful if you are at all concerned about security or thread safety. What you really want to do is have the system create a temporary directory for you, and then create a statically named file in that directory. Here's your code reworked to do that:
import tempfile
import os
with tempfile.TemporaryDirectory() as dir:
tempfile = os.path.join(dir, "temp1")
os.system("echo Hello world /tmp > " + tempfile)
with open(tempfile) as f:
buf = f.read()
print(buf)
This method has the added benefit of automatically cleaning up for you.
UPDATE: I have now seen #UlisseBordingnon's answer. That's a better solution overall. Using os.system() is discouraged. I would have gone a bit different of a way by using the subprocess module, but what they suggest is 100% valid, and is thread and security safe. I guess I'll leave my answer here as maybe you or other readers need to use os.system() or otherwise have the shell process you execute write directly to a file.
As others have suggested, you should use the subprocess module instead of os.system. However from subprocess you can use the most recent interface (and by most recent, I believe this was adding in Python 3.4) of subprocess.run.
The neat thing about using .run is that you can pass any file-like object to stdout and the stdout stream will automatically redirect to that file.
import tempfile
import subprocess
with tempfile.NamedTemporaryFile("w+") as f:
subprocess.run(["echo", "hello world"], stdout=f)
# command has finished running, let's check the file
f.seek(0)
print(f.read())
# hello world
If you are using python 3.5 or later (as with most of us), then use subprocess.run is better because you do not need a temporary file:
import subprocess
completed_process = subprocess.run(
["echo", "hello world"],
capture_output=True,
encoding="utf-8",
)
print(completed_process.stdout)
Notes
The capture_output parameter tells run() to save the output to the .stdout and .stderr attributes
The encoding parameter will convert the output from bytes to string
Depending on your needs, if your print your output, a quickier way, but maybe not exactly what you are looking for is to redirect the output to a file, at the command line level
Example(egfile.py):
import os
os.system("echo Hello world")
At command level you can simply do:
python egfile.py > file.txt
The output of the file will be redirected to the file instead to the screen

How to print unicode to both terminal and file redirect

I read everything there is to read about Unicode, UTF-8, encoding/decoding and everything, but I still strugle.
I made a short example snippet to illustrate my problem.
I want to print the string 'Geïrriteerd' just like it is written here. I need to use the following code to let it print properly to a file if I run it with a redirect to a file, like 'Test.py > output'
# coding=utf-8
import codecs
import sys
sys.stdout = codecs.getwriter('UTF-8')(sys.stdout)
print u'Geïrriteerd'
But if I do NOT redirect, the code above prints 'Geïrriteerd' to the terminal.
If I remove the 'codecs.getwriter' line, it prints fine again to the terminal but will print 'Geïrriteerd' to the file.
How can I get this to print properly in both cases?
I am using Python 2.7 on Windows 10. I know Python 3.x handles unicode better in general, but I can't use that in my project (yet) due to other dependencies.
Since redirection is a shell operation, it makes sense to control the encoding using the shell as well. Fortunately, Python provides an environment variable to control the encoding. Given test.py:
#!python2
# coding=utf-8
print u'Geïrriteerd'
To redirect to a file with a particular encoding, use:
C:\>set PYTHONIOENCODING=utf8
C:\>test >out.txt
Running the script normally with PYTHONIOENCODING undefined will use the encoding of the terminal (in my case cp437):
C:\>set PYTHONIOENCODING=
C:\>test
Geïrriteerd
Your terminal is set up for cp850 instead of UTF-8.
Run chcp 65001.
http://enwp.org/Chcp_(command)
http://enwp.org/Windows_code_page#List
You need to "encode" your unicode first to write to file or display. You do not really need the codecs module.
The docs provide really good examples for working with unicode.
print type(u'Geïrriteerd')
print type(u'Geïrriteerd'.encode('utf-8'))
print u'Geïrriteerd'.encode('utf-8')
with open('test.txt', 'wb') as f:
f.write(u'Geïrriteerd'.encode('utf-8'))
with open('test.txt', 'r') as f:
content = f.read()
print content
#If you want to use codecs still
import codecs
with codecs.open("test.txt", "w", encoding="utf-8") as f:
f.write(u'Geïrriteerd')
with open('test.txt', 'r') as f:
content = f.read()
print content

How am I suppposed to handle the BOM while text processing using sys.stdin in Python 3?

Note: The possible duplicate concerns an older version of Python and this question has already generated unique answers.
I have been working on a script to process Project Gutenberg Texts texts into an internal file format for an application I am developing. In the script I process chapter headings with the re module. This works very well except in one case: the first line. My regex will always fail on the first Chapter marker at the first line if it includes the ^ caret to require the regex match to be at the beginning of the line because the BOM is consumed as the first character. (Example regex: ^Chapter).
What I've discovered is that if I do not include the caret, it won't fail on the first line, and then <feff> is included in the heading after I've processed it. An example:
<h1><feff>Chapter I</h1>
The advice according to this SO question (from which I learned of the BOM) is to fix your script to not consume/corrupt the BOM. Other SO questions talk about decoding the file with a codec but discuss errors I never encounter and do not discuss the syntax for opening a file with the template decoder.
To be clear:
I generally use pipelines of the following format:
cat -s <filename> | <other scripts> | python <scriptname> [options] > <outfile>
And I am opening the file with the following syntax:
import sys
fin = sys.stdin
if '-i' in sys.argv: # For command line option "-i <infile>"
fin = open(sys.argv[sys.argv.index('-i') + 1], 'rt')
for line in fin:
...Processing here...
My question is what is the proper way to handle this? Do I remove the BOM before processing the text? If so, how? Or do I use a decoder on the file before processing it (I am reading from stdin, so how would I accomplish this?)
The files are stored in UTF-8 encoding with DOS endings (\r\n). I convert them in vim to UNIX file format before processing using set ff=unix (I have to do several manual pre-processing tasks before running the script).
As a complement to the existing answer, it is possible to filter the UTF8 BOM from stdin with the codecs module. Simply you must use sys.stdin.buffer to access the underlying byte stream and decode it with a StreamReader
import sys
import codecs
# trick to process sys.stdin with a custom encoding
fin = codecs.getreader('utf_8_sig')(sys.stdin.buffer, errors='replace')
if '-i' in sys.argv: # For command line option "-i <infile>"
fin = open(sys.argv[sys.argv.index('-i') + 1], 'rt',
encoding='utf_8_sig', errors='replace')
for line in fin:
...Processing here...
In Python 3, stdin should be auto-decoded properly, but if it's not working for you (and for Python 2) you need to specify PythonIOEncoding before invoking your script like
PYTHONIOENCODING="UTF-8-SIG" python <scriptname> [options] > <outfile>
Notice that this setting also makes stdout working with UTF-8-SIG, so your <outfile> will maintain the original encoding.
For your -i parameter, just do open(path, 'rt', encoding="UTF-8-SIG")
You really don't need to import codecs or anything to deal with this. As lenz suggested in comments just check for the BOM and throw it out.
for line in input:
if line[0] == "\ufeff":
line = line[1:] # trim the BOM away
# the rest of your code goes here as usual
In Python 3.9 default encoding for standard input seems to be utf-8, at least on Linux:
In [2]: import sys
In [3]: sys.stdin
Out[3]: <_io.TextIOWrapper name='<stdin>' mode='r' encoding='utf-8'>
sys.stdin has the method reconfigure():
sys.stdin.reconfigure("utf-8-sig")
which should be called before any attempt of reading the standard input. This will decode the BOM, which will no longer appear when reading sys.stdin.

Reading binary data from stdin

Is it possible to read stdin as binary data in Python 2.6? If so, how?
I see in the Python 3.1 documentation that this is fairly simple, but the facilities for doing this in 2.6 don't seem to be there.
If the methods described in 3.1 aren't available, is there a way to close stdin and reopen in in binary mode?
Just to be clear, I am using 'type' in a MS-DOS shell to pipe the contents of a binary file to my python code. This should be the equivalent of a Unix 'cat' command, as far as I understand. But when I test this out, I always get one byte less than the expected file size.
The reason I'm going the Java/JAR/Jython route is because one of my main external libraries is only available as a Java JAR. But unfortunately, I had started my work as Python. It might have been easier to convert my code over to Java a while ago, but since this stuff was all supposed to be compatible, I figured I would try trucking through it and prove it could be done.
In case anyone was wondering, this is also related to this question I asked a few days ago.
Some of was answered in this question.
So I'll try to update my original question with some notes on what I have figured out so far.
From the docs (see here):
The standard streams are in text mode
by default. To write or read binary
data to these, use the underlying
binary buffer. For example, to write
bytes to stdout, use
sys.stdout.buffer.write(b'abc').
But, as in the accepted answer, invoking python with a -u is another option which forces stdin, stdout and stderr to be totally unbuffered. See the python(1) manpage for details.
See the documentation on io for more information on text buffering, and use sys.stdin.detach() to disable buffering from within Python.
Here is the final cut for Linux/Windows Python 2/3 compatible code to read data from stdin without corruption:
import sys
PY3K = sys.version_info >= (3, 0)
if PY3K:
source = sys.stdin.buffer
else:
# Python 2 on Windows opens sys.stdin in text mode, and
# binary data that read from it becomes corrupted on \r\n
if sys.platform == "win32":
# set sys.stdin to binary mode
import os, msvcrt
msvcrt.setmode(sys.stdin.fileno(), os.O_BINARY)
source = sys.stdin
b = source.read()
Use the -u command line switch to force Python 2 to treat stdin, stdout and stderr as binary unbuffered streams.
C:> type mydoc.txt | python.exe -u myscript.py
If you still need this...
This simple test i've used to read binary file that contains 0x1A character in between
import os, sys, msvcrt
msvcrt.setmode (sys.stdin.fileno(), os.O_BINARY)
s = sys.stdin.read()
print len (s)
My test file data was:
0x23, 0x1A, 0x45
Without setting stdin to binary mode this test prints 1 as soon it treats 0x1A as EOF.
Of course it works on windows only, because depends on msvcrt module.
You can perform an unbuffered read with:
os.read(0, bytes_to_read)
with 0 being the file descriptor for stdin
import sys
data = sys.stdin.read(10) # Read 10 bytes from stdin
If you need to interpret binary data, use the struct module.

Writing unicode strings via sys.stdout in Python

Assume for a moment that one cannot use print (and thus enjoy the benefit of automatic encoding detection). So that leaves us with sys.stdout. However, sys.stdout is so dumb as to not do any sensible encoding.
Now one reads the Python wiki page PrintFails and goes to try out the following code:
$ python -c 'import sys, codecs, locale; print str(sys.stdout.encoding); \
sys.stdout = codecs.getwriter(locale.getpreferredencoding())(sys.stdout);
However this too does not work (at least on Mac). Too see why:
>>> import locale
>>> locale.getpreferredencoding()
'mac-roman'
>>> sys.stdout.encoding
'UTF-8'
(UTF-8 is what one's terminal understands).
So one changes the above code to:
$ python -c 'import sys, codecs, locale; print str(sys.stdout.encoding); \
sys.stdout = codecs.getwriter(sys.stdout.encoding)(sys.stdout);
And now unicode strings are properly sent to sys.stdout and hence printed properly on the terminal (sys.stdout is attached the terminal).
Is this the correct way to write unicode strings in sys.stdout or should I be doing something else?
EDIT: at times--say, when piping the output to less--sys.stdout.encoding will be None. in this case, the above code will fail.
export PYTHONIOENCODING=utf-8
will do the job, but can't set it on python itself ...
what we can do is verify if isn't setting and tell the user to set it before call script with :
if __name__ == '__main__':
if (sys.stdout.encoding is None):
print >> sys.stderr, "please set python env PYTHONIOENCODING=UTF-8, example: export PYTHONIOENCODING=UTF-8, when write to stdout."
exit(1)
Best idea is to check if you are directly connected to a terminal. If you are, use the terminal's encoding. Otherwise, use system preferred encoding.
if sys.stdout.isatty():
default_encoding = sys.stdout.encoding
else:
default_encoding = locale.getpreferredencoding()
It's also very important to always allow the user specify whichever encoding she wants. Usually I make it a command-line option (like -e ENCODING), and parse it with the optparse module.
Another good thing is to not overwrite sys.stdout with an automatic encoder. Create your encoder and use it, but leave sys.stdout alone. You could import 3rd party libraries that write encoded bytestrings directly to sys.stdout.
There is an optional environment variable "PYTHONIOENCODING" which may be set to a desired default encoding. It would be one way of grabbing the user-desired encoding in a way consistent with all of Python. It is buried in the Python manual here.
This is what I am doing in my application:
sys.stdout.write(s.encode('utf-8'))
This is the exact opposite fix for reading UTF-8 names from argv:
for file in sys.argv[1:]:
file = file.decode('utf-8')
This is very ugly (IMHO) as it force you to work with UTF-8.. which is the norm on Linux/Mac, but not on windows... Works for me anyway :)
It's not clear to my why you wouldn't be able to do print; but assuming so, yes, the approach looks right to me.

Categories