Reading binary data from stdin

Reading binary data from stdin - python

Is it possible to read stdin as binary data in Python 2.6? If so, how?
I see in the Python 3.1 documentation that this is fairly simple, but the facilities for doing this in 2.6 don't seem to be there.
If the methods described in 3.1 aren't available, is there a way to close stdin and reopen in in binary mode?
Just to be clear, I am using 'type' in a MS-DOS shell to pipe the contents of a binary file to my python code. This should be the equivalent of a Unix 'cat' command, as far as I understand. But when I test this out, I always get one byte less than the expected file size.
The reason I'm going the Java/JAR/Jython route is because one of my main external libraries is only available as a Java JAR. But unfortunately, I had started my work as Python. It might have been easier to convert my code over to Java a while ago, but since this stuff was all supposed to be compatible, I figured I would try trucking through it and prove it could be done.
In case anyone was wondering, this is also related to this question I asked a few days ago.
Some of was answered in this question.
So I'll try to update my original question with some notes on what I have figured out so far.

From the docs (see here):
The standard streams are in text mode
by default. To write or read binary
data to these, use the underlying
binary buffer. For example, to write
bytes to stdout, use
sys.stdout.buffer.write(b'abc').
But, as in the accepted answer, invoking python with a -u is another option which forces stdin, stdout and stderr to be totally unbuffered. See the python(1) manpage for details.
See the documentation on io for more information on text buffering, and use sys.stdin.detach() to disable buffering from within Python.

Here is the final cut for Linux/Windows Python 2/3 compatible code to read data from stdin without corruption:
import sys
PY3K = sys.version_info >= (3, 0)
if PY3K:
source = sys.stdin.buffer
else:
# Python 2 on Windows opens sys.stdin in text mode, and
# binary data that read from it becomes corrupted on \r\n
if sys.platform == "win32":
# set sys.stdin to binary mode
import os, msvcrt
msvcrt.setmode(sys.stdin.fileno(), os.O_BINARY)
source = sys.stdin
b = source.read()

Use the -u command line switch to force Python 2 to treat stdin, stdout and stderr as binary unbuffered streams.
C:> type mydoc.txt | python.exe -u myscript.py

If you still need this...
This simple test i've used to read binary file that contains 0x1A character in between
import os, sys, msvcrt
msvcrt.setmode (sys.stdin.fileno(), os.O_BINARY)
s = sys.stdin.read()
print len (s)
My test file data was:
0x23, 0x1A, 0x45
Without setting stdin to binary mode this test prints 1 as soon it treats 0x1A as EOF.
Of course it works on windows only, because depends on msvcrt module.

You can perform an unbuffered read with:
os.read(0, bytes_to_read)
with 0 being the file descriptor for stdin

import sys
data = sys.stdin.read(10) # Read 10 bytes from stdin
If you need to interpret binary data, use the struct module.

Related

tail and less commands not monitoring file in real time

I'm looking for a way to monitor a file that is written to by a program on Linux. I found the tail -F command in here, and also recommended was less +FG. I tested it by running tail -F file in one terminal, and a simple python script:
import time
for i in range(20):
print i
time.sleep(0.5)
in another. I redirected the output to the file:
python script.py >> file
I expected that tail would track the file contents and update the display in fixed intervals, instead it only shows what was written to the file after the command terminates.
The same thing happens with less +FG and also if I watch the output from cat. I've also tried using the usual redirect which truncates the file > instead of >>. Here it says the file was truncated, but still does not track it in real time.
Any idea why this doesn't work? (It's suggested here that it might be due to buffered writes, but since my script runs over 10 seconds, I suspect this might not be the cause)
Edit: In case it matters, I'm running Linux Mint 18.1

Python's standard out is buffered. If when you close the script / script is done, you see all the output - that's definitely buffer issue.
You can use this instead:
import time
import sys
for i in range(20):
sys.stdout.write('%d\n' % i)
sys.stdout.flush()
time.sleep(0.5)
I've tested it and it prints values in real time. To overcome buffer issue, after each .write() method I use .flush() force "flushing" the buffer.
Additional options from the comments:
Use the original print statement with sys.stdout.flush() afterwords
Run the python script with python -u for unbuffered binary stdout and stderr

Regarding jon1467 answer (sorry can't comment your answer), your understanding of redirection is wrong.
Try this :
dd if=/dev/urandom > test.txt
while looking at the file size with :
ls -l test.txt
You'll see the file grow while dd is running.
Vinny's answer is correct, python standard output is buffered.
The more common way to the "buffering effect" you notice is by flushing the stdout as Vinny showed you.
You could also use -u option to disable buffering for the whole python process, or you could just reopen standard output with a buffer size of 0 as below (in python2 at least):
sys.stdout = os.fdopen(sys.stdout.fileno(), 'w', 0)

Stdin and Stdout

Can someone explain stdin and stdout? I don't understand what is the difference between using these two objects for user input and output as opposed to print and raw_input. Perhaps there is some vital information I am missing. Can anyone explain?

stdin and stdout are the streams for your operating system's standard input and output.
You use them to read and write data from your operating system's std input (usually keyboard) and output (your screen, the python console, or such).
print is simply a function which writes to the operting system's stdout and adds a newline to the end.
There are more features in print than just this, but that's the basic idea.
# Simplified print implementation
def print(value, end='\n'):
stdout.write(value)
stdout.write(end)

stdin and stdout are stream representations of the standard in- and output that your OS supplies Python with.
You can do almost everything you can do with a file on these, so for many applications, they are far more useful than eg. print, which adds linebreaks etc.

Running a command line from python and piping arguments from memory

I was wondering if there was a way to run a command line executable in python, but pass it the argument values from memory, without having to write the memory data into a temporary file on disk. From what I have seen, it seems to that the subprocess.Popen(args) is the preferred way to run programs from inside python scripts.
For example, I have a pdf file in memory. I want to convert it to text using the commandline function pdftotext which is present in most linux distros. But I would prefer not to write the in-memory pdf file to a temporary file on disk.
pdfInMemory = myPdfReader.read()
convertedText = subprocess.<method>(['pdftotext', ??]) <- what is the value of ??
what is the method I should call and how should I pipe in memory data into its first input and pipe its output back to another variable in memory?
I am guessing there are other pdf modules that can do the conversion in memory and information about those modules would be helpful. But for future reference, I am also interested about how to pipe input and output to the commandline from inside python.
Any help would be much appreciated.

with Popen.communicate:
import subprocess
out, err = subprocess.Popen(["pdftotext", "-", "-"], stdout=subprocess.PIPE).communicate(pdf_data)

os.tmpfile is useful if you need a seekable thing. It uses a file, but it's nearly as simple as a pipe approach, no need for cleanup.
tf=os.tmpfile()
tf.write(...)
tf.seek(0)
subprocess.Popen( ... , stdin = tf)
This may not work on Posix-impaired OS 'Windows'.

Popen.communicate from subprocess takes an input parameter that is used to send data to stdin, you can use that to input your data. You also get the output of your program from communicate, so you don't have to write it into a file.
The documentation for communicate explicitly warns that everything is buffered in memory, which seems to be exactly what you want to achieve.

Writing unicode strings via sys.stdout in Python

Assume for a moment that one cannot use print (and thus enjoy the benefit of automatic encoding detection). So that leaves us with sys.stdout. However, sys.stdout is so dumb as to not do any sensible encoding.
Now one reads the Python wiki page PrintFails and goes to try out the following code:
$ python -c 'import sys, codecs, locale; print str(sys.stdout.encoding); \
sys.stdout = codecs.getwriter(locale.getpreferredencoding())(sys.stdout);
However this too does not work (at least on Mac). Too see why:
>>> import locale
>>> locale.getpreferredencoding()
'mac-roman'
>>> sys.stdout.encoding
'UTF-8'
(UTF-8 is what one's terminal understands).
So one changes the above code to:
$ python -c 'import sys, codecs, locale; print str(sys.stdout.encoding); \
sys.stdout = codecs.getwriter(sys.stdout.encoding)(sys.stdout);
And now unicode strings are properly sent to sys.stdout and hence printed properly on the terminal (sys.stdout is attached the terminal).
Is this the correct way to write unicode strings in sys.stdout or should I be doing something else?
EDIT: at times--say, when piping the output to less--sys.stdout.encoding will be None. in this case, the above code will fail.

export PYTHONIOENCODING=utf-8
will do the job, but can't set it on python itself ...
what we can do is verify if isn't setting and tell the user to set it before call script with :
if __name__ == '__main__':
if (sys.stdout.encoding is None):
print >> sys.stderr, "please set python env PYTHONIOENCODING=UTF-8, example: export PYTHONIOENCODING=UTF-8, when write to stdout."
exit(1)

Best idea is to check if you are directly connected to a terminal. If you are, use the terminal's encoding. Otherwise, use system preferred encoding.
if sys.stdout.isatty():
default_encoding = sys.stdout.encoding
else:
default_encoding = locale.getpreferredencoding()
It's also very important to always allow the user specify whichever encoding she wants. Usually I make it a command-line option (like -e ENCODING), and parse it with the optparse module.
Another good thing is to not overwrite sys.stdout with an automatic encoder. Create your encoder and use it, but leave sys.stdout alone. You could import 3rd party libraries that write encoded bytestrings directly to sys.stdout.

There is an optional environment variable "PYTHONIOENCODING" which may be set to a desired default encoding. It would be one way of grabbing the user-desired encoding in a way consistent with all of Python. It is buried in the Python manual here.

This is what I am doing in my application:
sys.stdout.write(s.encode('utf-8'))
This is the exact opposite fix for reading UTF-8 names from argv:
for file in sys.argv[1:]:
file = file.decode('utf-8')
This is very ugly (IMHO) as it force you to work with UTF-8.. which is the norm on Linux/Mac, but not on windows... Works for me anyway :)

It's not clear to my why you wouldn't be able to do print; but assuming so, yes, the approach looks right to me.

How to write binary data to stdout in python 3?

In python 2.x I could do this:
import sys, array
a = array.array('B', range(100))
a.tofile(sys.stdout)
Now however, I get a TypeError: can't write bytes to text stream. Is there some secret encoding that I should use?

A better way:
import sys
sys.stdout.buffer.write(b"some binary data")

An idiomatic way of doing so, which is only available for Python 3, is:
with os.fdopen(sys.stdout.fileno(), "wb", closefd=False) as stdout:
stdout.write(b"my bytes object")
stdout.flush()
The good part is that it uses the normal file object interface, which everybody is used to in Python.
Notice that I'm setting closefd=False to avoid closing sys.stdout when exiting the with block. Otherwise, your program wouldn't be able to print to stdout anymore. However, for other kind of file descriptors, you may want to skip that part.

import os
os.write(1, a.tostring())
or, os.write(sys.stdout.fileno(), …) if that's more readable than 1 for you.

In case you would like to specify an encoding in python3 you can still use the bytes command like below:
import os
os.write(1,bytes('Your string to Stdout','UTF-8'))
where 1 is the corresponding usual number for stdout --> sys.stdout.fileno()
Otherwise if you don't care of the encoding just use:
import sys
sys.stdout.write("Your string to Stdout\n")
If you want to use the os.write without the encoding, then try to use the below:
import os
os.write(1,b"Your string to Stdout\n")

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Reading binary data from stdin - python

Use the -u command line switch to force Python 2 to treat stdin, stdout and stderr as binary unbuffered streams. C:> type mydoc.txt | python.exe -u myscript.py

You can perform an unbuffered read with: os.read(0, bytes_to_read) with 0 being the file descriptor for stdin

import sys data = sys.stdin.read(10) # Read 10 bytes from stdin If you need to interpret binary data, use the struct module.

Related

tail and less commands not monitoring file in real time

Stdin and Stdout

Running a command line from python and piping arguments from memory

Writing unicode strings via sys.stdout in Python

How to write binary data to stdout in python 3?

Categories

Resources