How do python pipe still work through spawning processes?

How do python pipe still work through spawning processes? - python

I'm trying to understand why the following code works:
import multiprocessing
def send_message(conn):
# Send a message through the pipe
conn.send("Hello, world!")
if __name__ == '__main__':
multiprocessing.set_start_method('spawn')
# Create a pipe
parent_conn, child_conn = multiprocessing.Pipe()
# Create a child process
p = multiprocessing.Process(target=send_message, args=(child_conn,))
p.start()
# Wait for the child process to finish
p.join()
# Read the message from the pipe
message = parent_conn.recv()
print(message)
As I understand, python pipes are just regular OS pipes, which are file descriptors.
When a new process is created via spawn , we should lose all the file descriptors (contrary to regular fork)
In that case, how is it possible that the python pipe is still "connected" to its parent process?

The documentation does not suggest that it will lose all the file descriptors - only that "unnecessary file descriptors and handles from the parent process will not be inherited". To figure out how this is achieved exactly in CPython, first we need to see what exactly happens when p.start() is called in the example code.
After some point upon starting the process, the Process instance's underlying Popen helper will be used, in the case for 'spawn' it would be the version provided by popen_spawn_posix. As part of the startup sequence goes, it will get the relevant data that is required to start the process, this includes which function to call and then their arguments (code), which a specific pickler is used.
Given that the Connection object (which Pipe is built upon of) has defined a hook that actually ensures the relevant file descriptor is marked for duplication. This is ultimately invoked from here which points back to the helper function at the 'spawn' version of the Popen.duplicate_for_child, ensuring that any connection objects passed (in your case, args=(child_conn,)) will have their file descriptors passed through to the actual start function spawnv_passfds such that the child process will have access to them.
I will note that I have glossed over various other details, but if you wish to you can always attach a debugger and trace through the startup sequence, which is what I did to derive this answer.

Related

Why stdin of parent process still accepts inputs after closing stdin file descriptor from a forked child process?

fork(2) manual page on the Linux system I'm running says the following:
The child inherits copies of the parent's set of open file descriptors. Each file descriptor in the child refers to the same open file description (see open(2)) as the corresponding file descriptor in the parent. This means that the two file descriptors share open file status flags, file offset, and signal-driven I/O attributes (see the description of F_SETOWN and F_SET‐SIG in fcntl(2)).
And Python Documentation mentions
_exit() should normally only be used in the child process after a fork().
Of course, _exit won't call cleanup handlers, the problem is that, if you look at this code for instance:
newpid = os.fork()
if newpid == 0:
os.close(0)
else:
time.sleep(.25)
input()
The parent process still accepts inputs from stdin despite the fact the child process closes stdin. That's good and here's the code reversed:
newpid = os.fork()
if newpid == 0:
input()
else:
time.sleep(.25)
os.close(0)
Now, it's the opposite, this time the parent process that closes stdin not the child. And this raises EOFError for the input() call in the child process.
This looks like when the [child] process writes/modifies the parent's file descriptors, it does not affect the [parent]. That is, the child process gets newer file descriptions.
Then why call _exit as Python Docs states to prevent invoking cleanup handlers if operations performed by the child process does not affect the parent process? Let's take a look at _EXIT(2) man page:
The function _exit() terminates the calling process "immediately". Any open file descriptors belonging to the process are closed; any children of the process are inherited by process 1, init, and the process's parent is sent a SIGCHLD` signal.
The function _exit() is like exit(3), but does not call any functions registered with atexit(3) or on_exit(3). Open stdio(3) streams are not flushed. On the other hand, _exit() does close open file descriptors, and this may cause an unknown delay, waiting for pending output to finish.
fork() manual page doesn't mention that the cleanup handlers of the child process are inherited from the parent. How does this affect the parent in any way? In other words, why not just let the child process clean up after itself and why not?

I'm assuming you're running this from a shell within a terminal.
The shell launches the Python process in a new process group and uses tcsetpgrp() to set it as the foreground process group on the TTY.
Once the parent Python process terminates, the shell reclaims control of the terminal (it sets itself as the foreground process group). The shell does not know the forked child from Python is still running.
When a process which is not part of the foreground process group tries to read from the terminal, it normally receives a SIGTTIN signal. However, in this case, the process group has been orphaned because its leader has terminated, thus the child process gets an EIO error from read() on the TTY. Python treats this as EOFError.

Reading output from child process using python

The Context
I am using the subprocess module to start a process from python. I want to be able to access the output (stdout, stderr) as soon as it is written/buffered.
The solution must support Windows 7. I require a solution for Unix systems too but I suspect the Windows case is more difficult to solve.
The solution should support Python 2.6. I am currently restricted to Python 2.6 but solutions using later versions of Python are still appreciated.
The solution should not use third party libraries. Ideally I would love a solution using the standard library but I am open to suggestions.
The solution must work for just about any process. Assume there is no control over the process being executed.
The Child Process
For example, imagine I want to run a python file called counter.py via a subprocess. The contents of counter.py is as follows:
import sys
for index in range(10):
# Write data to standard out.
sys.stdout.write(str(index))
# Push buffered data to disk.
sys.stdout.flush()
The Parent Process
The parent process responsible for executing the counter.py example is as follows:
import subprocess
command = ['python', 'counter.py']
process = subprocess.Popen(
cmd,
bufsize=1,
stdout=subprocess.PIPE,
stderr=subprocess.PIPE,
)
The Issue
Using the counter.py example I can access the data before the process has completed. This is great! This is exactly what I want. However, removing the sys.stdout.flush() call prevents the data from being accessed at the time I want it. This is bad! This is exactly what I don't want. My understanding is that the flush() call forces the data to be written to disk and before the data is written to disk it exists only in a buffer. Remember I want to be able to run just about any process. I do not expect the process to perform this kind of flushing but I still expect the data to be available in real time (or close to it). Is there a way to achieve this?
A quick note about the parent process. You may notice I am using bufsize=0 for line buffering. I was hoping this would cause a flush to disk for every line but it doesn't seem to work that way. How does this argument work?
You will also notice I am using subprocess.PIPE. This is because it appears to be the only value which produces IO objects between the parent and child processes. I have come to this conclusion by looking at the Popen._get_handles method in the subprocess module (I'm referring to the Windows definition here). There are two important variables, c2pread and c2pwrite which are set based on the stdout value passed to the Popen constructor. For instance, if stdout is not set, the c2pread variable is not set. This is also the case when using file descriptors and file-like objects. I don't really know whether this is significant or not but my gut instinct tells me I would want both read and write IO objects for what I am trying to achieve - this is why I chose subprocess.PIPE. I would be very grateful if someone could explain this in more detail. Likewise, if there is a compelling reason to use something other than subprocess.PIPE I am all ears.
Method For Retrieving Data From The Child Process
import time
import subprocess
import threading
import Queue
class StreamReader(threading.Thread):
"""
Threaded object used for reading process output stream (stdout, stderr).
"""
def __init__(self, stream, queue, *args, **kwargs):
super(StreamReader, self).__init__(*args, **kwargs)
self._stream = stream
self._queue = queue
# Event used to terminate thread. This way we will have a chance to
# tie up loose ends.
self._stop = threading.Event()
def stop(self):
"""
Stop thread. Call this function to terminate the thread.
"""
self._stop.set()
def stopped(self):
"""
Check whether the thread has been terminated.
"""
return self._stop.isSet()
def run(self):
while True:
# Flush buffered data (not sure this actually works?)
self._stream.flush()
# Read available data.
for line in iter(self._stream.readline, b''):
self._queue.put(line)
# Breather.
time.sleep(0.25)
# Check whether thread has been terminated.
if self.stopped():
break
cmd = ['python', 'counter.py']
process = subprocess.Popen(
cmd,
bufsize=1,
stdout=subprocess.PIPE,
)
stdout_queue = Queue.Queue()
stdout_reader = StreamReader(process.stdout, stdout_queue)
stdout_reader.daemon = True
stdout_reader.start()
# Read standard out of the child process whilst it is active.
while True:
# Attempt to read available data.
try:
line = stdout_queue.get(timeout=0.1)
print '%s' % line
# If data was not read within time out period. Continue.
except Queue.Empty:
# No data currently available.
pass
# Check whether child process is still active.
if process.poll() != None:
# Process is no longer active.
break
# Process is no longer active. Nothing more to read. Stop reader thread.
stdout_reader.stop()
Here I am performing the logic which reads standard out from the child process in a thread. This allows for the scenario in which the read is blocking until data is available. Instead of waiting for some potentially long period of time, we check whether there is available data, to be read within a time out period, and continue looping if there is not.
I have also tried another approach using a kind of non-blocking read. This approach uses the ctypes module to access Windows system calls. Please note that I don't fully understand what I am doing here - I have simply tried to make sense of some example code I have seen in other posts. In any case, the following snippet doesn't solve the buffering issue. My understanding is that it's just another way to combat a potentially long read time.
import os
import subprocess
import ctypes
import ctypes.wintypes
import msvcrt
cmd = ['python', 'counter.py']
process = subprocess.Popen(
cmd,
bufsize=1,
stdout=subprocess.PIPE,
)
def read_output_non_blocking(stream):
data = ''
available_bytes = 0
c_read = ctypes.c_ulong()
c_available = ctypes.c_ulong()
c_message = ctypes.c_ulong()
fileno = stream.fileno()
handle = msvcrt.get_osfhandle(fileno)
# Read available data.
buffer_ = None
bytes_ = 0
status = ctypes.windll.kernel32.PeekNamedPipe(
handle,
buffer_,
bytes_,
ctypes.byref(c_read),
ctypes.byref(c_available),
ctypes.byref(c_message),
)
if status:
available_bytes = int(c_available.value)
if available_bytes > 0:
data = os.read(fileno, available_bytes)
print data
return data
while True:
# Read standard out for child process.
stdout = read_output_non_blocking(process.stdout)
print stdout
# Check whether child process is still active.
if process.poll() != None:
# Process is no longer active.
break
Comments are much appreciated.
Cheers

At issue here is buffering by the child process. Your subprocess code already works as well as it could, but if you have a child process that buffers its output then there is nothing that subprocess pipes can do about this.
I cannot stress this enough: the buffering delays you see are the responsibility of the child process, and how it handles buffering has nothing to do with the subprocess module.
You already discovered this; this is why adding sys.stdout.flush() in the child process makes the data show up sooner; the child process uses buffered I/O (a memory cache to collect written data) before sending it down the sys.stdout pipe 1.
Python automatically uses line-buffering when sys.stdout is connected to a terminal; the buffer flushes whenever a newline is written. When using pipes, sys.stdout is not connected to a terminal and a fixed-size buffer is used instead.
Now, the Python child process can be told to handle buffering differently; you can set an environment variable or use a command-line switch to alter how it uses buffering for sys.stdout (and sys.stderr and sys.stdin). From the Python command line documentation:
-u
Force stdin, stdout and stderr to be totally unbuffered. On systems where it matters, also put stdin, stdout and stderr in binary mode.
[...]
PYTHONUNBUFFERED
If this is set to a non-empty string it is equivalent to specifying the -u option.
If you are dealing with child processes that are not Python processes and you experience buffering issues with those, you'll need to look at the documentation of those processes to see if they can be switched to use unbuffered I/O, or be switched to more desirable buffering strategies.
One thing you could try is to use the script -c command to provide a pseudo-terminal to a child process. This is a POSIX tool, however, and is probably not available on Windows.
1. It should be noted that when flushing a pipe, no data is 'written to disk'; all data remains entirely in memory here. I/O buffers are just memory caches to get the best performance out of I/O by handling data in larger chunks. Only if you have a disk-based file object would fileobj.flush() cause it to push any buffers to the OS, which usually means that data is indeed written to disk.

expect has a command called 'unbuffer':
http://expect.sourceforge.net/example/unbuffer.man.html
that will disable buffering for any command

Python Multiprocessing Documentation Example

I'm trying to learn Python multiprocessing.
http://docs.python.org/2/library/multiprocessing.html from the example of "To show the individual process IDs involved, here is an expanded example:"
from multiprocessing import Process
import os
def info(title):
print title
print 'module name:', __name__
if hasattr(os, 'getppid'): # only available on Unix
print 'parent process:', os.getppid()
print 'process id:', os.getpid()
def f(name):
info('function f')
print 'hello', name
if __name__ == '__main__':
info('main line')
p = Process(target=f, args=('bob',))
p.start()
p.join()
What exactly am I looking at? I see that def f(name): is called after info('main line') is finished, but this synchronous call would be default anyways. I see that the same process info('main line') is the parent PID of def f(name): but not sure what is 'multiprocessing' about that.
Also, with join() "Block the calling thread until the process whose join() method is called terminates". I'm not clear on what the calling thread would be. In this example what would join() be blocking?

How multiprocessing works, in a nutshell:
Process() spawns (fork or similar on Unix-like systems) a copy of the original program (on Windows, which lacks a real fork, this is tricky and requires the special care that the module documentation notes).
The copy communicates with the original to figure out that (a) it's a copy and (b) it should go off and invoke the target= function (see below).
At this point, the original and copy are now different and independent, and can run simultaneously.
Since these are independent processes, they now have independent Global Interpreter Locks (in CPython) so both can use up to 100% of a CPU on a multi-cpu box, as long as they don't contend for other lower-level (OS) resources. That's the "multiprocessing" part.
Of course, at some point you have to send data back and forth between these supposedly-independent processes, e.g., to send results from one (or many) worker process(es) back to a "main" process. (There is the occasional exception where everyone's completely independent, but it's rare ... plus there's the whole start-up sequence itself, kicked off by p.start().) So each created Process instance—p, in the above example—has a communications channel to its parent creator and vice versa (it's a symmetric connection). The multiprocessing module uses the pickle module to turn data into strings—the same strings you can stash in files with pickle.dump—and sends the data across the channel, "downwards" to workers to send arguments and such, and "upwards" from workers to send back results.
Eventually, once you're all done with getting results, the worker finishes (by returning from the target= function) and tells the parent it's done. To make sure everything gets closed and cleaned-up, the parent should call p.join() to wait for the worker's "I'm done" message (actually an OS-level exit on Unix-ish sysems).
The example is a little bit silly since the two printed messages take basically no time at all, so running them "at the same time" has no measurable gain. But suppose instead of just printing hello, f were to calculate the first 100,000 digits of π (3.14159...). You could then spawn another Process, p2 with a different target g that calculates the first 100,000 digits of e (2.71828...). These would run independently. The parent could then call p.join() and p2.join() to wait for both to complete (or spawn yet more workers to do more work and occupy more CPUs, or even go off and do its own work for a while first).

python: what happens when an object is passed in multiprocessing.Process?

p = Process(target=f, args=(myObject,))
p.start()
p.join()
From experimentation, inside of function f(), I can access myObject fine and its members appears to be intact, even though presumably we're in a different process. Printing id(myObject) in the current function and in f() returns the same number.
Is Python secretly performing IPC when myObject is accessed inside of f()?

As Winston wrote: on Unix the process will be forked and the forked process is basically a full copy of the parent process (that's why the id is identical).

The actual process depends on whether you are running unix or windows.
On *nix, fork() is used which creates a complete copy of your process.
On windows, I believe the object is pickled (see the pickle module) and sent over some IPC channel.

Python program using os.pipe and os.fork() issue

I've recently needed to write a script that performs an os.fork() to split into two processes. The child process becomes a server process and passes data back to the parent process using a pipe created with os.pipe(). The child closes the 'r' end of the pipe and the parent closes the 'w' end of the pipe, as usual. I convert the returns from pipe() into file objects with os.fdopen.
The problem I'm having is this: The process successfully forks, and the child becomes a server. Everything works great and the child dutifully writes data to the open 'w' end of the pipe. Unfortunately the parent end of the pipe does two strange things:
A) It blocks on the read() operation on the 'r' end of the pipe.
Secondly, it fails to read any data that was put on the pipe unless the 'w' end is entirely closed.
I immediately thought that buffering was the problem and added pipe.flush() calls, but these didn't help.
Can anyone shed some light on why the data doesn't appear until the writing end is fully closed? And is there a strategy to make the read() call non blocking?
This is my first Python program that forked or used pipes, so forgive me if I've made a simple mistake.

Are you using read() without specifying a size, or treating the pipe as an iterator (for line in f)? If so, that's probably the source of your problem - read() is defined to read until the end of the file before returning, rather than just read what is available for reading. That will mean it will block until the child calls close().
In the example code linked to, this is OK - the parent is acting in a blocking manner, and just using the child for isolation purposes. If you want to continue, then either use non-blocking IO as in the code you posted (but be prepared to deal with half-complete data), or read in chunks (eg r.read(size) or r.readline()) which will block only until a specific size / line has been read. (you'll still need to call flush on the child)
It looks like treating the pipe as an iterator is using some further buffer as well, for "for line in r:" may not give you what you want if you need each line to be immediately consumed. It may be possible to disable this, but just specifying 0 for the buffer size in fdopen doesn't seem sufficient.
Heres some sample code that should work:
import os, sys, time
r,w=os.pipe()
r,w=os.fdopen(r,'r',0), os.fdopen(w,'w',0)
pid = os.fork()
if pid: # Parent
w.close()
while 1:
data=r.readline()
if not data: break
print "parent read: " + data.strip()
else: # Child
r.close()
for i in range(10):
print >>w, "line %s" % i
w.flush()
time.sleep(1)

Using
fcntl.fcntl(readPipe, fcntl.F_SETFL, os.O_NONBLOCK)
Before invoking the read() solved both problems. The read() call is no longer blocking and the data is appearing after just a flush() on the writing end.

I see you have solved the problem of blocking i/o and buffering.
A note if you decide to try a different approach: subprocess is the equivalent / a replacement for the fork/exec idiom. It seems like that's not what you're doing: you have just a fork (not an exec) and exchanging data between the two processes -- in this case the multiprocessing module (in Python 2.6+) would be a better fit.

The "parent" vs. "child" part of fork in a Python application is silly. It's a legacy from 16-bit unix days. It's an affectation from a day when fork/exec and exec were Important Things to make the most of a tiny little processor.
Break your Python code into two separate parts: parent and child.
The parent part should use subprocess to run the child part.
A fork and exec may happen somewhere in there -- but you don't need to care.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.