Reads after select() - blocking on a pipe

Reads after select() - blocking on a pipe - python

My Python program needs to multiplex reads from several different file descriptors. Some of them are the stdout/stderr descriptors of subprocesses; others are the file descriptors associated with inotify calls.
My problem is being able to do a "non-blocking"[1] read after select(). According to the documentation, sockets that select() reports to be ready for writes "are guaranteed to not block on a write of up to PIPE_BUF bytes".
I suppose that no such guarantee makes sense with a read, as select() reporting that there is data waiting to be ready in the kernel pipe buffer doesn't mean that you can go ahead and to .read(socket.PIPE_BUF), as there could be just a few bytes in there.
This means that when I'm calling read() on the socket, I can get what is effectively a deadlock as some of the subprocesses produce output very rarely.
Is there any way around this? My current workaround is to call readline() on it, and I'm lucky enough that everything I'm reading from has line-by-line output. Is select() of any use at all when reading from a pipe like this, seeing as there's no way to know how many bytes you can safely read without blocking?
[1] I'm aware that this is distinct from an O_NONBLOCK socket

It's OK to go ahead and read each pipe and socket: you'll get whatever data are available now:
>>> import os
>>> desc = os.pipe()
>>> desc
(3, 4)
>>> os.write(desc[1], 'foo')
3
>>> os.read(desc[0], 100)
'foo'
>>> os.read(desc[0], 100)
[hangs here as there's no input available, interrupt with ^C]
...
KeyboardInterrupt
>>> os.write(desc[1], 'a')
1
>>> os.read(desc[0], 100)
'a'
>>>

Just as an alternative, I ran into exactly the same problem and solved it by using readline(1) and appending that to an internal buffer until readline returned a character that I was interested in tokenizing on (newline, space, etc.).
More detail: I called select() on a file descriptor and then called readline(1) on any file descriptor that was returned by select, appended that char to a buffer, and repeated until readline returned what I wanted. Then I returned my buffer, cleared it and moved on. Incidentally, I also returned a Boolean that let the calling method know if the data that I was returning was empty because of a bad read of just because it wasn't done.
I also implemented a version that would tokenize on a timeout. If I'd been buffering for x ms without finding a newline or EOF, go ahead and return the buffer.
I'm currently trying to find out if there's a way to ask a file descriptor how many bytes it has waiting to be read, then just readline([that many bytes])...
Hope that helps.

Related

How do I ignore characters using the python pty module?

I want to write a command-line program that communicates with other interactive programs through a pseudo-terminal. In particular I want to be able to cause keystrokes received to conditionally be sent to the underlying process. Let's say for an example that I would like to silently ignore any "e" characters that are sent.
I know that Python has a pty module for working with pseudo-terminals and I have a basic version of my program working using it:
import os
import pty
def script_read(stdin):
data = os.read(stdin, 1024)
if data == b"e":
return ... # What goes here?
return data
pty.spawn(["bash"], script_read)
From experimenting, I know that returning an empty bytes object b"" causes the pty.spawn implementation to think that the underlying file descriptor has reached the end of file and should no longer be read from, which causes the terminal to become totally unresponsive (I had to kill my terminal emulator!).

For interactive use, the simplest way to do this is probably to just return a bytes object containing a single null byte: b"\0". The terminal emulator will not print anything for it and so it will look like that input is just completely ignored.
This probably isn't great for certain usages of pseudo-terminals. In particular, if the content written to the pseudo-terminal is going to be written again by the attached program this would probably cause random null bytes to appear in the file. Testing with cat as the attached program, the sequence ^# is printed to the terminal whenever a null byte is sent to it.
So, YMMV.
A more proper solution would be to create a wrapper type that can masquerade as an empty string for the purposes of os.write but that would evaluate as "truthy" in a boolean context to not trigger the end of file conditional. I did some experimenting with this and couldn't figure out what needs to be faked to make os.write fully accept the wrapper as a string type. I'm unclear if it's even possible. :(
Here's my initial attempt at creating such a wrapper type:
class EmptyBytes():
def __init__(self):
self.sliced = False
def __class__(self):
return type(b"")
def __getitem__(self, _key):
return b""

Three ways to print in Python -- when to use each?

According to Tim Peters, "There should be one-- and preferably only one --obvious way to do it." In Python, there appears to be three ways to print information:
print('Hello World', end='')
sys.stdout.write('Hello World')
os.write(1, b'Hello World')
Question: Are there best-practice policies that state when each of these three different methods of printing should be used in a program?

Note that the statement of Tim is perfectly correct: there is only one obvious way to do it: print().
The other two possibilities that you mention have different goals.
If we want to summarize the goals of the three alternatives:
print is the high-level function that allow you to write something to stdout(or an other file). It provides a simple and readable API, with some fancy options about how the single items are separated, or whether you want to add or not a terminator etc. This is what you want to do most of the time.
sys.stdout.write is just a method of the file objects. So the real point of sys.stdout is that you can pass it around as if it were any other file. This is useful when you have to deal with a function that is expecting a file and you want it to print the text directly on stdout.
In other words you shouldn't use sys.stdout.write at all. You just pass around sys.stdout to code that expects a file.
Note: in python2 there were some situations where using the print statement produced worse code than calling sys.stdout.write. However the print function allows you to define the separator and terminator and thus avoids almost all these corner cases.
os.write is a low-level call to write to a file. You must manually encode the contents and you also have to pass the file descriptor explicitly. This is meant to handle only low level code that, for some reason, cannot be implemented on top of the higher-level interfaces. You almost never want to call this directly, because it's not required and has a worse API than the rest.
Note that if you have code that should write down things on a file, it's better to do:
my_file.write(a)
# ...
my_file.write(b)
# ...
my_file.write(c)
Than:
print(a, file=my_file)
# ...
print(b, file=my_file)
# ...
print(c, file=my_file)
Because it's more DRY. Using print you have to repeat file= everytime. This is fine if you have to write only in one place of the code, but if you have 5/6 different writes is much easier to simply call the write method directly.

To me print is the right way to print to stdout, but :
There is a good reason why sys.stdout.write exists - Imagine a class which generates some text output, and you want to make it write to either stdout, and file on disk, or a string. Ideally the class really shouldn't care what output type it is writing to. The class can simple be given a file object, and so long as that object supports the write method, the class can use the write method to output the text.

Two of these methods require importing entire modules. Based on this alone, print() is the best standard use option.
sys.stdout is useful whenever stdout may change. This gives quite a bit of power for stream handling.
os.write is useful for os specific writing tasks (non blocking writes for instance)
This question has been asked a number of times on this site for sys.stdout vs. print:
Python - The difference between sys.stdout.write and print
print() vs sys.stdout.write(): which and why?
One example for using os.write (non blocking file writes demonstrated in the question below). The function may only be useful on some os's but it still must remain portable even when certain os's don't support different/special behaviors.
How to write to a file using non blocking IO?

readinto() replacement?

Copying a File using a straight-forward approach in Python is typically like this:
def copyfileobj(fsrc, fdst, length=16*1024):
"""copy data from file-like object fsrc to file-like object fdst"""
while 1:
buf = fsrc.read(length)
if not buf:
break
fdst.write(buf)
(This code snippet is from shutil.py, by the way).
Unfortunately, this has drawbacks in my special use-case (involving threading and very large buffers) [Italics part added later]. First, it means that with each call of read() a new memory chunk is allocated and when buf is overwritten in the next iteration this memory is freed, only to allocate new memory again for the same purpose. This can slow down the whole process and put unnecessary load on the host.
To avoid this I'm using the file.readinto() method which, unfortunately, is documented as deprecated and "don't use":
def copyfileobj(fsrc, fdst, length=16*1024):
"""copy data from file-like object fsrc to file-like object fdst"""
buffer = array.array('c')
buffer.fromstring('-' * length)
while True:
count = fsrc.readinto(buffer)
if count == 0:
break
if count != len(buffer):
fdst.write(buffer.toString()[:count])
else:
buf.tofile(fdst)
My solution works, but there are two drawbacks as well: First, readinto() is not to be used. It might go away (says the documentation). Second, with readinto() I cannot decide how many bytes I want to read into the buffer and with buffer.tofile() I cannot decide how many I want to write, hence the cumbersome special case for the last block (which also is unnecessarily expensive).
I've looked at array.array.fromfile(), but it cannot be used to read "all there is" (reads, then throws EOFError and doesn't hand out the number of processed items). Also it is no solution for the ending special-case problem.
Is there a proper way to do what I want to do? Maybe I'm just overlooking a simple buffer class or similar which does what I want.

This code snippet is from shutil.py
Which is a standard library module. Why not just use it?
First, it means that with each call of read() a new memory chunk is allocated and when buf is overwritten in the next iteration this memory is freed, only to allocate new memory again for the same purpose. This can slow down the whole process and put unnecessary load on the host.
This is tiny compared to the effort required to actually grab a page of data from disk.

Normal Python code would not be in need off such tweaks as this - however if you really need all that performance tweaking to read files from inside Python code (as in, you are on the rewriting some server coe you wrote and already works for performance or memory usage) I'd rather call the OS directly using ctypes - thus having a copy performed as low level as I want too.
It may even be possible that simple calling the "cp" executable as an external process is less of a hurdle in your case (and it would take full advantages of all OS and filesystem level optimizations for you).

UNIX named PIPE end of file

I'm trying to use a unix named pipe to output statistics of a running service. I intend to provide a similar interface as /proc where one can see live stats by catting a file.
I'm using a code similar to this in my python code:
while True:
f = open('/tmp/readstatshere', 'w')
f.write('some interesting stats\n')
f.close()
/tmp/readstatshere is a named pipe created by mknod.
I then cat it to see the stats:
$ cat /tmp/readstatshere
some interesting stats
It works fine most of the time. However, if I cat the entry several times in quick successions, sometimes I get multiple lines of some interesting stats instead of one. Once or twice, it has even gone into an infinite loop printing that line forever until I killed it. The only fix that I've got so far is to put a delay of let's say 500ms after f.close() to prevent this issue.
I'd like to know why exactly this happens and if there is a better way of dealing with it.
Thanks in advance

A pipe is simply the wrong solution here. If you want to present a consistent snapshot of the internal state of your process, write that to a temporary file and then rename it to the "public" name. This will prevent all issues that can arise from other processes reading the state while you're updating it. Also, do NOT do that in a busy loop, but ideally in a thread that sleeps for at least one second between updates.

What about a UNIX socket instead of a pipe?
In this case, you can react on each connect by providing fresh data just in time.
The only downside is that you cannot cat the data; you'll have to create a new socket handle and connect() to the socket file.
MYSOCKETFILE = '/tmp/mysocket'
import socket
import os
try:
os.unlink(MYSOCKETFILE)
except OSError: pass
s = socket.socket(socket.AF_UNIX)
s.bind(MYSOCKETFILE)
s.listen(10)
while True:
s2, peeraddr = s.accept()
s2.send('These are my actual data')
s2.close()
Program querying this socket:
MYSOCKETFILE = '/tmp/mysocket'
import socket
import os
s = socket.socket(socket.AF_UNIX)
s.connect(MYSOCKETFILE)
while True:
d = s.recv(100)
if not d: break
print d
s.close()

I think you should use fuse.
it has python bindings, see http://pypi.python.org/pypi/fuse-python/
this allows you to compose answers to questions formulated as posix filesystem system calls

Don't write to an actual file. That's not what /proc does. Procfs presents a virtual (non-disk-backed) filesystem which produces the information you want on demand. You can do the same thing, but it'll be easier if it's not tied to the filesystem. Instead, just run a web service inside your Python program, and keep your statistics in memory. When a request comes in for the stats, formulate them into a nice string and return them. Most of the time you won't need to waste cycles updating a file which may not even be read before the next update.

You need to unlink the pipe after you issue the close. I think this is because there is a race condition where the pipe can be opened for reading again before cat finishes and it thus sees more data and reads it out, leading to multiples of "some interesting stats."
Basically you want something like:
while True:
os.mkfifo(the_pipe)
f = open(the_pipe, 'w')
f.write('some interesting stats')
f.close()
os.unlink(the_pipe)
Update 1: call to mkfifo
Update 2: as noted in the comments, there is a race condition in this code as well with multiple consumers.

subprocess stdout/stderr to finite size logfile

I have a process which chats a lot to stderr, and I want to log that stuff to a file.
foo 2> /tmp/foo.log
Actually I'm launching it with python subprocess.Popen, but it may as well be from the shell for the purposes of this question.
with open('/tmp/foo.log', 'w') as stderr:
foo_proc = subprocess.Popen(['foo'], stderr=stderr)
The problem is after a few days my log file can be very large, like >500 MB. I am interested in all that stderr chat, but only the recent stuff. How can I limit the size of the logfile to, say, 1 MB? The file should be a bit like a circular buffer in that the most recent stuff will be written but the older stuff should fall out of the file, so that it never goes above a given size.
I'm not sure if there's an elegant Unixey way to do this already which I'm simply not aware of, with some sort of special file.
An alternative solution with log rotation would be sufficient for my needs as well, as long as I don't have to interrupt the running process.

You should be able to use the stdlib logging package to do this. Instead of connecting the subprocess' output directly to a file, you can do something like this:
import logging
logger = logging.getLogger('foo')
def stream_reader(stream):
while True:
line = stream.readline()
logger.debug('%s', line.strip())
This just logs every line received from the stream, and you can configure logging with a RotatingFileHandler which provides log file rotation. You then arrange to read this data and log it.
foo_proc = subprocess.Popen(['foo'], stderr=subprocess.PIPE)
thread = threading.Thread(target=stream_reader, args=(foo_proc.stderr,))
thread.setDaemon(True) # optional
thread.start()
# do other stuff
thread.join() # await thread termination (optional for daemons)
Of course you can call stream_reader(foo_proc.stderr) too, but I'm assuming you might have other work to do while the foo subprocess does its stuff.
Here's one way you could configure logging (code that should only be executed once):
import logging, logging.handlers
handler = logging.handlers.RotatingFileHandler('/tmp/foo.log', 'a', 100000, 10)
logging.getLogger().addHandler(handler)
logging.getLogger('foo').setLevel(logging.DEBUG)
This will create up to 10 files of 100K named foo.log (and after rotation foo.log.1, foo.log.2 etc., where foo.log is the latest). You could also pass in 1000000, 1 to give you just foo.log and foo.log.1, where the rotation happens when the file would exceed 1000000 bytes in size.

The way with circular buffer would be hard to implement, as you would constantly have to rewrite the whole file as soon as something falls out.
The approach with logrotate or something would be your way to go. In this case, you simply would do similiar to this:
import subprocess
import signal
def hupsignal(signum, frame):
global logfile
logfile.close()
logfile = open('/tmp/foo.log', 'a')
logfile = open('/tmp/foo.log', 'a')
signal.signal()
foo_proc = subprocess.Popen(['foo'], stderr=subprocess.PIPE)
for chunk in iter(lambda: foo_proc.stderr.read(8192), ''):
# iterate until EOF occurs
logfile.write(chunk)
# or do you want to rotate yourself?
# Then omit the signal stuff and do it here.
# if logfile.tell() > MAX_FILE_SIZE:
# logfile.close()
# logfile = open('/tmp/foo.log', 'a')
It is not a complete solution; think of it as pseudocode as it is untested and I am not sure about the syntax in the one or other place. Probably it needs some modification for making it work. But you should get the idea.
As well, it is an example of how to make it work with logrotate. Of course, you can rotate your logfile yourself, if needed.

You may be able to use the properties of 'open file descriptions' (distinct from, but closely related to, 'open file descriptors'). In particular, the current write position is associated with the open file description, so two processes that share an single open file description can each adjust the write position.
So, in context, the original process could retain the file descriptor for standard error of the child process, and periodically, when the position reaches your 1 MiB size, reposition the pointer to the start of the file, thus achieving your required circular buffer effect.
The biggest problem is determining where the current messages are being written, so that you can read from the oldest material (just in front of the file position) to the newest material. It is unlikely that new lines overwriting the old will match exactly, so there'd be some debris. You might be able to follow each line from the child with a known character sequence (say 'XXXXXX'), and then have each write from the child reposition to overwrite the previous marker...but that definitely requires control over the program that's being run. If it is not under your control, or cannot be modified, that option vanishes.
An alternative would be to periodically truncate the file (maybe after copying it), and to have the child process write in append mode (because the file is opened in the parent in append mode). You could arrange to copy the material from the file to a spare file before truncating to preserve the previous 1 MiB of data. You might use up to 2 MiB that way, which is a lot better than 500 MiB and the sizes could be configured if you're actually short of space.
Have fun!

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.