python: what happens when an object is passed in multiprocessing.Process? - python

p = Process(target=f, args=(myObject,))
p.start()
p.join()
From experimentation, inside of function f(), I can access myObject fine and its members appears to be intact, even though presumably we're in a different process. Printing id(myObject) in the current function and in f() returns the same number.
Is Python secretly performing IPC when myObject is accessed inside of f()?

As Winston wrote: on Unix the process will be forked and the forked process is basically a full copy of the parent process (that's why the id is identical).

The actual process depends on whether you are running unix or windows.
On *nix, fork() is used which creates a complete copy of your process.
On windows, I believe the object is pickled (see the pickle module) and sent over some IPC channel.

Related

How do python pipe still work through spawning processes?

I'm trying to understand why the following code works:
import multiprocessing
def send_message(conn):
# Send a message through the pipe
conn.send("Hello, world!")
if __name__ == '__main__':
multiprocessing.set_start_method('spawn')
# Create a pipe
parent_conn, child_conn = multiprocessing.Pipe()
# Create a child process
p = multiprocessing.Process(target=send_message, args=(child_conn,))
p.start()
# Wait for the child process to finish
p.join()
# Read the message from the pipe
message = parent_conn.recv()
print(message)
As I understand, python pipes are just regular OS pipes, which are file descriptors.
When a new process is created via spawn , we should lose all the file descriptors (contrary to regular fork)
In that case, how is it possible that the python pipe is still "connected" to its parent process?
The documentation does not suggest that it will lose all the file descriptors - only that "unnecessary file descriptors and handles from the parent process will not be inherited". To figure out how this is achieved exactly in CPython, first we need to see what exactly happens when p.start() is called in the example code.
After some point upon starting the process, the Process instance's underlying Popen helper will be used, in the case for 'spawn' it would be the version provided by popen_spawn_posix. As part of the startup sequence goes, it will get the relevant data that is required to start the process, this includes which function to call and then their arguments (code), which a specific pickler is used.
Given that the Connection object (which Pipe is built upon of) has defined a hook that actually ensures the relevant file descriptor is marked for duplication. This is ultimately invoked from here which points back to the helper function at the 'spawn' version of the Popen.duplicate_for_child, ensuring that any connection objects passed (in your case, args=(child_conn,)) will have their file descriptors passed through to the actual start function spawnv_passfds such that the child process will have access to them.
I will note that I have glossed over various other details, but if you wish to you can always attach a debugger and trace through the startup sequence, which is what I did to derive this answer.

Python Multiprocessing Documentation Example

I'm trying to learn Python multiprocessing.
http://docs.python.org/2/library/multiprocessing.html from the example of "To show the individual process IDs involved, here is an expanded example:"
from multiprocessing import Process
import os
def info(title):
print title
print 'module name:', __name__
if hasattr(os, 'getppid'): # only available on Unix
print 'parent process:', os.getppid()
print 'process id:', os.getpid()
def f(name):
info('function f')
print 'hello', name
if __name__ == '__main__':
info('main line')
p = Process(target=f, args=('bob',))
p.start()
p.join()
What exactly am I looking at? I see that def f(name): is called after info('main line') is finished, but this synchronous call would be default anyways. I see that the same process info('main line') is the parent PID of def f(name): but not sure what is 'multiprocessing' about that.
Also, with join() "Block the calling thread until the process whose join() method is called terminates". I'm not clear on what the calling thread would be. In this example what would join() be blocking?
How multiprocessing works, in a nutshell:
Process() spawns (fork or similar on Unix-like systems) a copy of the original program (on Windows, which lacks a real fork, this is tricky and requires the special care that the module documentation notes).
The copy communicates with the original to figure out that (a) it's a copy and (b) it should go off and invoke the target= function (see below).
At this point, the original and copy are now different and independent, and can run simultaneously.
Since these are independent processes, they now have independent Global Interpreter Locks (in CPython) so both can use up to 100% of a CPU on a multi-cpu box, as long as they don't contend for other lower-level (OS) resources. That's the "multiprocessing" part.
Of course, at some point you have to send data back and forth between these supposedly-independent processes, e.g., to send results from one (or many) worker process(es) back to a "main" process. (There is the occasional exception where everyone's completely independent, but it's rare ... plus there's the whole start-up sequence itself, kicked off by p.start().) So each created Process instance—p, in the above example—has a communications channel to its parent creator and vice versa (it's a symmetric connection). The multiprocessing module uses the pickle module to turn data into strings—the same strings you can stash in files with pickle.dump—and sends the data across the channel, "downwards" to workers to send arguments and such, and "upwards" from workers to send back results.
Eventually, once you're all done with getting results, the worker finishes (by returning from the target= function) and tells the parent it's done. To make sure everything gets closed and cleaned-up, the parent should call p.join() to wait for the worker's "I'm done" message (actually an OS-level exit on Unix-ish sysems).
The example is a little bit silly since the two printed messages take basically no time at all, so running them "at the same time" has no measurable gain. But suppose instead of just printing hello, f were to calculate the first 100,000 digits of π (3.14159...). You could then spawn another Process, p2 with a different target g that calculates the first 100,000 digits of e (2.71828...). These would run independently. The parent could then call p.join() and p2.join() to wait for both to complete (or spawn yet more workers to do more work and occupy more CPUs, or even go off and do its own work for a while first).

Need to find which program called the python script

I am using a build system(waf) which is a wrapper around python. There are some programs(perl scripts,exe's etc) calling the python build system. When I execute the build scripts from cmd.exe, I need to find out the program that called it. My OS is windows 7. I tried getting the parent PID in a python module and it returns "cmd" as PPID and "python.exe" as PID, so that approach did not help me in finding what I am looking for.
I believe I should be looking at some stacktraces on a OS level, but am not able to find how to do it. Please help me with the approach I should take or a possible code snippet. I just need to know the name of the script or program that called the system, example caller.perl, callload.exe
Thank you
Though I am not sure why it would be needed but this is a fun problem in itself, so here are few tips, once you have parent PID loop thru processes and get name e.g.
using WMI
import wmi
c = wmi.WMI ()
for process in c.Win32_Process ():
if process.ProcessId == ppid:
print process.ProcessId, process.Name
I think you can do same thing using win32 API, e.g.
processes = win32process.EnumProcesses()
for pid in processes:
if pid == ppid:
handle = win32api.OpenProcess(win32con.PROCESS_ALL_ACCESS,
False, pid)
exe = win32process.GetModuleFileNameEx(handle, 0)
This will work for simple cases when progA directly executes progB but if there is a long chain of child process in between, it may not be good solution. Best way for a generic case would be for calling program to tell his identity by passing it as argument e.g.
progB --calledfrom progA
modify the python script to add an argument to it, stating which file called it. then log it into a logger file. all scripts calling it will have to identify themselves to the python script via the argument vector.
For example:
foo.pl calls yourfile.py as:
yourfile.py /path/to/foo.pl
yourfile.py:
def main(argv):
logger.print(argv[1])
I was able to use process explorer to see the chain of processes called and was able to retrieve the name by just traversing the parent. Thanks for all who replied.

How to tell process id within Python

I am working with a cluster system over linux (www.mosix.org) that allows me to run jobs and have the system run them on different computers. Jobs are run like so:
mosrun ls &
This will naturally create the process and run it on the background, returning the process id, like so:
[1] 29199
Later it will return. I am writing a Python infrastructure that would run jobs and control them. For that I want to run jobs using the mosrun program as above, and save the process ID of the spawned process (29199 in this case). This naturally cannot be done using os.system or commands.getoutput, as the printed ID is not what the process prints to output... Any clues?
Edit:
Since the python script is only meant to initially run the script, the scripts need to run longer than the python shell. I guess it means the mosrun process cannot be the script's child process. Any suggestions?
Thanks
Use subprocess module. Popen instances have a pid attribute.
Looks like you want to ensure the child process is daemonized -- PEP 3143, which I'm pointing to, documents and points to a reference implementation for that, and points to others too.
Once your process (still running Python code) is daemonized, be it by the means offered in PEP 3143 or others, you can os.execl (or other os.exec... function) your target code -- this runs said target code in exactly the same process which we just said is daemonized, and so it keeps being daemonized, as desired.
The last step cannot use subprocess because it needs to run in the same (daemonized) process, overlaying its executable code -- exactly what os.execl and friends are for.
The first step, before daemonization, might conceivably be done via subprocess, but that's somewhat inconvenient (you need to put the daemonize-then-os.exec code in a separate .py): most commonly you'd just want to os.fork and immediately daemonize the child process.
subprocess is quite convenient as a mostly-cross-platform way to run other processes, but it can't really replace Unix's good old "fork and exec" approach for advanced uses (such as daemonization, in this case) -- which is why it's a good thing that the Python standard library also lets you do the latter via those functions in module os!-)
Thanks all for the help. Here's what I did in the end, and seems to work ok. The code uses python-daemon. Maybe something smarter should be done about transferring the process id from the child to the father, but that's the easier part.
import daemon
def run_in_background(command, tmp_dir="/tmp"):
# Decide on a temp file beforehand
warnings.filterwarnings("ignore", "tempnam is a potential security")
tmp_filename = os.tempnam(tmp_dir)
# Duplicate the process
pid = os.fork()
# If we're child, daemonize and run
if pid == 0:
with daemon.DaemonContext():
child_id = os.getpid()
file(tmp_filename,'w').write(str(child_id))
sp = command.split(' ')
os.execl(*([sp[0]]+sp))
else:
# If we're a parent, poll for the new file
n_iter = 0
while True:
if os.path.exists(tmp_filename):
child_id = int(file(tmp_filename, 'r').read().strip())
break
if n_iter == 100:
raise Exception("Cannot read process id from temp file %s" % tmp_filename)
n_iter += 1
time.sleep(0.1)
return child_id

Python program using os.pipe and os.fork() issue

I've recently needed to write a script that performs an os.fork() to split into two processes. The child process becomes a server process and passes data back to the parent process using a pipe created with os.pipe(). The child closes the 'r' end of the pipe and the parent closes the 'w' end of the pipe, as usual. I convert the returns from pipe() into file objects with os.fdopen.
The problem I'm having is this: The process successfully forks, and the child becomes a server. Everything works great and the child dutifully writes data to the open 'w' end of the pipe. Unfortunately the parent end of the pipe does two strange things:
A) It blocks on the read() operation on the 'r' end of the pipe.
Secondly, it fails to read any data that was put on the pipe unless the 'w' end is entirely closed.
I immediately thought that buffering was the problem and added pipe.flush() calls, but these didn't help.
Can anyone shed some light on why the data doesn't appear until the writing end is fully closed? And is there a strategy to make the read() call non blocking?
This is my first Python program that forked or used pipes, so forgive me if I've made a simple mistake.
Are you using read() without specifying a size, or treating the pipe as an iterator (for line in f)? If so, that's probably the source of your problem - read() is defined to read until the end of the file before returning, rather than just read what is available for reading. That will mean it will block until the child calls close().
In the example code linked to, this is OK - the parent is acting in a blocking manner, and just using the child for isolation purposes. If you want to continue, then either use non-blocking IO as in the code you posted (but be prepared to deal with half-complete data), or read in chunks (eg r.read(size) or r.readline()) which will block only until a specific size / line has been read. (you'll still need to call flush on the child)
It looks like treating the pipe as an iterator is using some further buffer as well, for "for line in r:" may not give you what you want if you need each line to be immediately consumed. It may be possible to disable this, but just specifying 0 for the buffer size in fdopen doesn't seem sufficient.
Heres some sample code that should work:
import os, sys, time
r,w=os.pipe()
r,w=os.fdopen(r,'r',0), os.fdopen(w,'w',0)
pid = os.fork()
if pid: # Parent
w.close()
while 1:
data=r.readline()
if not data: break
print "parent read: " + data.strip()
else: # Child
r.close()
for i in range(10):
print >>w, "line %s" % i
w.flush()
time.sleep(1)
Using
fcntl.fcntl(readPipe, fcntl.F_SETFL, os.O_NONBLOCK)
Before invoking the read() solved both problems. The read() call is no longer blocking and the data is appearing after just a flush() on the writing end.
I see you have solved the problem of blocking i/o and buffering.
A note if you decide to try a different approach: subprocess is the equivalent / a replacement for the fork/exec idiom. It seems like that's not what you're doing: you have just a fork (not an exec) and exchanging data between the two processes -- in this case the multiprocessing module (in Python 2.6+) would be a better fit.
The "parent" vs. "child" part of fork in a Python application is silly. It's a legacy from 16-bit unix days. It's an affectation from a day when fork/exec and exec were Important Things to make the most of a tiny little processor.
Break your Python code into two separate parts: parent and child.
The parent part should use subprocess to run the child part.
A fork and exec may happen somewhere in there -- but you don't need to care.

Categories