Understanding os.fork and Queue.Queue - python

I wanted to implement a simple python program using parallel execution. It's I/O bound, so I figured threads would be appropriate (as opposed to processes). After reading the documentation for Queue and fork, I thought something like the following might work.
q = Queue.Queue()
if os.fork(): # child
while True:
print q.get()
else: # parent
[q.put(x) for x in range(10)]
However, the get() call never returns. I thought it would return once the other thread executes a put() call. Using the threading module, things behave more like I expected:
q = Queue.Queue()
def consume(q):
while True:
print q.get()
worker = threading.Thread (target=consume, args=(q,))
worker.start()
[q.put(x) for x in range(10)]
I just don't understand why the fork approach doesn't do the same thing. What am I missing?

The POSIX fork system call creates a new process, rather than a new thread inside the same adress space:
The fork() function shall create a new process. The new process (child
process) shall be an exact copy of the calling process (parent
process) except as detailed below: [...]
So the Queue is duplicated in your first example, rather than shared between the parent and child.
You can use multiprocessing.Queue instead or just use threads like in your second example :)
By the way, using list comprehensions just for side effects isn't good practice for several reasons. You should use a for loop instead:
for x in range(10): q.put(x)

To share the data between unrelated processes, you can use named pipes. Through the os.open() funcion..
http://docs.python.org/2/library/os.html#os.open. You can simply name a pipe as named_pipe='my_pipe' and in a different python programs use os.open(named_pipe, ), where mode is WRONLY and so on. After that you'll make a FIFO to write into the pipe. Don't forget to close the pipe and catch exceptions..

Fork creates a new process. The child and parent processes do not share the same Queue: that's why the elements put by the parent process cannot be retrieved by the child.

Related

multiprocessing - child process constantly sending back results and keeps running

Is it possible to have a few child processes running some calculations, then send the result to main process (e.g. update PyQt ui), but the processes are still running, after a while they send back data and update ui again?
With multiprocessing.queue, it seems like the data can only be sent back after process is terminated.
So I wonder whether this case is possible or not.
I don't know what you mean by "With multiprocessing.queue, it seems like the data can only be sent back after process is terminated". This is exactly the use case that Multiprocessing.Queue was designed for.
PyMOTW is a great resource for a whole load of Python modules, including Multiprocessing. Check it out here: https://pymotw.com/2/multiprocessing/communication.html
A simple example of how to send ongoing messages from a child to the parent using multiprocessing and loops:
import multiprocessing
def child_process(q):
for i in range(10):
q.put(i)
q.put("done") # tell the parent process we've finished
def parent_process():
q = multiprocessing.Queue()
child = multiprocessing.Process(target=child_process, args=(q,))
child.start()
while True:
value = q.get()
if value == "done": # no more values from child process
break
print value
# do other stuff, child will continue to run in separate process

Why won't Python Multiprocessing Workers die?

I'm using the python multiprocessing functionality to map some function across some elements. Something along the lines of this:
def computeStuff(arguments, globalData, concurrent=True):
pool = multiprocessing.Pool(initializer=initWorker, initargs=(globalData,))
results = pool.map(workerFunction, list(enumerate(arguments)))
return results
def initWorker(globalData):
workerFunction.globalData = globalData
def workerFunction((index, argument)):
... # computation here
Generally I run tests in ipython using both cPython and Pypy. I have noticed that the spawned processes often don't get killed, so they start accumulating, each using a gig of ram. This happens when hitting ctrl-k during a computation, which sends multiprocessing into a big frenzy of confusion. But even when letting computation finish, those processes won't die in Pypy.
According to the documentation, when the pool gets garbage collected, it should call terminate() and kill all the processes. What's happening here? Do I have to explicitly call close()? If yes, is there some sort of context manager that properly manages closing the resources (i.e. processes)?
This is on Mac OS X Yosemite.
PyPy's garbage collection is lazy, so failing to call close means the Pool is cleaned "sometime", but that might not mean "anytime soon".
Once the Pool is properly closed, the workers exit when they run out of tasks. An easy way to ensure the Pool is closed in pre-3.3 Python is:
from contextlib import closing
def computeStuff(arguments, globalData, concurrent=True):
with closing(multiprocessing.Pool(initializer=initWorker, initargs=(globalData,))) as pool:
return pool.map(workerFunction, enumerate(arguments))
Note: I also removed the explicit conversion to list (pointless, since map will iterate the enumerate iterator for you), and returned the results directly (no need to assign to a name only to return on the next line).
If you want to ensure immediate termination in the exception case (on pre-3.3 Python), you'd use a try/finally block, or write a simple context manager (which could be reused for other places where you use a Pool):
from contextlib import contextmanager
#contextmanager
def terminating(obj):
try:
yield obj
finally:
obj.terminate()
def computeStuff(arguments, globalData, concurrent=True):
with terminating(multiprocessing.Pool(initializer=initWorker, initargs=(globalData,))) as pool:
return pool.map(workerFunction, enumerate(arguments))
The terminating approach is superior in that it guarantees the processes exit immediately; in theory, if you're using threads elsewhere in your main program, the Pool workers might be forked with non-daemon threads, which would keep the processes alive even when the worker task thread exited; terminating hides this by killing the processes forcibly.
If your interpreter is Python 3.3 or higher, the terminating approach is built-in to Pool, so no special wrapper is needed for the with statement, with multiprocessing.Pool(initializer=initWorker, initargs=(globalData,)) as pool: works directly.

Can I safely use global Queues when using multiprocessing in python?

I have a large codebase to parallelise. I can avoid rewriting the method signatures of hundreds of functions by using a single global queue. I know it's messy; please don't tell me that if I'm using globals I'm doing something wrong in this case it really is the easiest choice. The code below works but i don't understand why. I declare a global multiprocessing.Queue() but don't declare that it should be shared between processes (by passing it as a parameter to the worker). Does python automatically place this queue in shared memory? Is it safe to do this on a larger scale?
Note: You can tell that the queue is shared between the processes: the worker processes start doing work on empty queues and are idle for one second before the main queue pushes some work onto the queues.
import multiprocessing
import time
outqueue = None
class WorkerProcess(multiprocessing.Process):
def __init__(self):
multiprocessing.Process.__init__(self)
self.exit = multiprocessing.Event()
def doWork(self):
global outqueue
ob = outqueue.get()
ob = ob + "!"
print ob
time.sleep(1) #simulate more hard work
outqueue.put(ob)
def run(self):
while not self.exit.is_set():
self.doWork()
def shutdown(self):
self.exit.set()
if __name__ == '__main__':
global outqueue
outqueue = multiprocessing.Queue()
procs = []
for x in range(10):
procs.append(WorkerProcess())
procs[x].start()
time.sleep(1)
for x in range(20):
outqueue.put(str(x))
time.sleep(10)
for p in procs:
p.shutdown()
for p in procs:
p.join()
try:
while True:
x = outqueue.get(False)
print x
except:
print "done"
Assuming you're using Linux, the answer is in the way the OS creates a new process.
When a process spawns a new one in Linux, it actually forks the parent one. The result is a child process with all the properties of the parent one. Basically a clone.
In your example you are instantiating the Queue and then creating the new processes. Therefore the children processes will have a copy of the same queue and will be able to use it.
To see things broken just try to first create the processes and then creating the Queue object. You'll see the children having the global variable still set as None while the parent will have a Queue.
It is safe, yet not recommended, to share a Queue as a global variable on Linux. On Windows, due to the different process creation approach, sharing a queue through a global variable won't work.
As mentioned in the programming guidelines
Explicitly pass resources to child processes
On Unix using the fork start method, a child process can make use of a shared resource created in a parent process using a global resource. However, it is better to pass the object as an argument to the constructor for the child process.
Apart from making the code (potentially) compatible with Windows and the other start methods this also ensures that as long as the child process is still alive the object will not be garbage collected in the parent process. This might be important if some resource is freed when the object is garbage collected in the parent process.
For more info about Linux forking you can read its man page.

Does a queue make sense for multiprocessing even when you only expect a single result back?

I want to execute a function in another process and get a single result back (either true or false). I know the common way of getting results back from multiprocessing is using a queue, but does it make sense if I only expect a single result back?
p = Process(target=my_function_that_returns_boolean, args=(self, args))
p.start()
p.join()
# success = p.somehow_get_the_result_back
If you prefer to use Process rather than Pool, the documentation tells us that there are two ways to exchange objects between processes.
The first is Queue which you have already seen.
The second is Pipe, which the documentation provides an example for. I have slightly modified the example to show your case of returning a boolean.
from multiprocessing import Process, Pipe
def Foo(conn):
# Do necessary processing here
# ....
# Instead of Return True, we send true
#return True
conn.send(True)
parent_conn, child_conn = Pipe()
p = Process(target=Foo, args=(child_conn,))
p.start()
print parent_conn.recv()
p.join()
Queues are used to synchronize access to shared resources in a parallel environment. Common scenarios are when many workers consume tasks from a shared pool or when one execution line creates tasks and another consumes them.
If I understand correctly, it isn't an issue here. So there is no need to use queues. The only synchronization mechanism you need is one that tells one process that the other is done. This is achieved by using join().
Unless there is a real problem just keep things as simple as possible.
You can use a Pool which returns an AsyncResult object
from multiprocessing import Pool
pool = Pool(processes=1)
result = pool.apply_async(my_function_that_returns_boolean, [(self, args),])
success = result.get(timeout=None)

How to list Processes started by multiprocessing Pool?

While attempting to store multiprocessing's process instance in multiprocessing list-variable 'poolList` I am getting a following exception:
SimpleQueue objects should only be shared between processes through inheritance
The reason why I would like to store the PROCESS instances in a variable is to be able to terminate all or just some of them later (if for example a PROCESS freezes). If storing a PROCESS in variable is not an option I would like to know how to get or to list all the PROCESSES started by mutliprocessing POOL. That would be very similar to what .current_process() method does. Except .current_process gets only a single process while I need all the processes started or all the processes currently running.
Two questions:
Is it even possible to store an instance of the Process (as a result of mp.current_process()
Currently I am only able to get a single process from inside of the function that the process is running (from inside of myFunct() using .current_process() method).
Instead I would like to to list all the processes currently running by multiprocessing. How to achieve it?
import multiprocessing as mp
poolList=mp.Manager().list()
def myFunct(arg):
print 'myFunct(): current process:', mp.current_process()
try: poolList.append(mp.current_process())
except Exception, e: print e
for i in range(110):
for n in range(500000):
pass
poolDict[arg]=i
print 'myFunct(): completed', arg, poolDict
from multiprocessing import Pool
pool = Pool(processes=2)
myArgsList=['arg1','arg2','arg3']
pool=Pool(processes=2)
pool.map_async(myFunct, myArgsList)
pool.close()
pool.join()
To list the processes started by a Pool()-instance(which is what you mean if I understand you correctly), there is the pool._pool-list. And it contains the instances of the processes.
However, it is not part of the documented interface and hence, really should not be used.
BUT...it seems a little bit unlikely that it would change just like that anyway. I mean, should they stop having an internal list of processes in the pool? And not call that _pool?
And also, it annoys me that there at least isn't a get processes-method. Or something.
And handling it breaking due to some name change should not be that difficult.
But still, use at your own risk:
from multiprocessing import pool
# Have to run in main
if __name__ == '__main__':
# Create 3 worker processes
_my_pool = pool.Pool(3)
# Loop, terminate, and remove from the process list
# Use a copy [:] of the list to remove items correctly
for _curr_process in _my_pool._pool[:]:
print("Terminating process "+ str(_curr_process.pid))
_curr_process.terminate()
_my_pool._pool.remove(_curr_process)
# If you call _repopulate, the pool will again contain 3 worker processes.
_my_pool._repopulate_pool()
for _curr_process in _my_pool._pool[:]:
print("After repopulation "+ str(_curr_process.pid))
The example creates a pool and manually terminates all processes.
It is important that you remember to delete the process you terminate from the pool yourself i you want Pool() to continue working as usual.
_my_pool._repopulate increases the number of working processes to 3 again, not needed to answer the question, but gives a little bit of behind-the-scenes insight.
Yes you can get all active process and perform action based on name of process
e.g
multiprocessing.Process(target=foo, name="refresh-reports")
and then
for p in multiprocessing.active_children():
if p.name == "foo":
p.terminate()
You're creating a managed List object, but then letting the associated Manager object expire.
Process objects are shareable because they aren't pickle-able; that is, they aren't simple.
Oddly the multiprocessing module doesn't have the equivalent of threading.enumerate() -- that is, you can't list all outstanding processes. As a workaround, I just store procs in a list. I never terminate() a process, but do sys.exit(0) in the parent. It's rough, because the workers will leave things in an inconsistent state, but it's okay for smaller programs
To kill a frozen worker, I suggest: 1) worker receives "heartbeat" jobs in a queue every now and then, 2) if parent notices worker A hasn't responded to a heartbeat in a certain amount of time, then p.terminate(). Consider restating the problem in another SO question, as it's interesting.
To be honest the map stuff is much easier than using a Manager.
Here's a Manager example I've used. A worker adds stuff to a shared list. Another worker occasionally wakes up, processes everything on the list, then goes back to sleep. The code also has verbose logs, which are essential for ease in debugging.
source
# producer adds to fixed-sized list; scanner uses them
import logging, multiprocessing, sys, time
def producer(objlist):
'''
add an item to list every sec; ensure fixed size list
'''
logger = multiprocessing.get_logger()
logger.info('start')
while True:
try:
time.sleep(1)
except KeyboardInterrupt:
return
msg = 'ding: {:04d}'.format(int(time.time()) % 10000)
logger.info('put: %s', msg)
del objlist[0]
objlist.append( msg )
def scanner(objlist):
'''
every now and then, run calculation on objlist
'''
logger = multiprocessing.get_logger()
logger.info('start')
while True:
try:
time.sleep(5)
except KeyboardInterrupt:
return
logger.info('items: %s', list(objlist))
def main():
logger = multiprocessing.log_to_stderr(
level=logging.INFO
)
logger.info('setup')
# create fixed-length list, shared between producer & consumer
manager = multiprocessing.Manager()
my_objlist = manager.list( # pylint: disable=E1101
[None] * 10
)
multiprocessing.Process(
target=producer,
args=(my_objlist,),
name='producer',
).start()
multiprocessing.Process(
target=scanner,
args=(my_objlist,),
name='scanner',
).start()
logger.info('running forever')
try:
manager.join() # wait until both workers die
except KeyboardInterrupt:
pass
logger.info('done')
if __name__=='__main__':
main()

Categories