Sharing a result queue among several processes - python

The documentation for the multiprocessing module shows how to pass a queue to a process started with multiprocessing.Process. But how can I share a queue with asynchronous worker processes started with apply_async? I don't need dynamic joining or anything else, just a way for the workers to (repeatedly) report their results back to base.
import multiprocessing
def worker(name, que):
que.put("%d is done" % name)
if __name__ == '__main__':
pool = multiprocessing.Pool(processes=3)
q = multiprocessing.Queue()
workers = pool.apply_async(worker, (33, q))
This fails with:
RuntimeError: Queue objects should only be shared between processes through inheritance.
I understand what this means, and I understand the advice to inherit rather than require pickling/unpickling (and all the special Windows restrictions). But how do I pass the queue in a way that works? I can't find an example, and I've tried several alternatives that failed in various ways. Help please?

Try using multiprocessing.Manager to manage your queue and to also make it accessible to different workers.
import multiprocessing
def worker(name, que):
que.put("%d is done" % name)
if __name__ == '__main__':
pool = multiprocessing.Pool(processes=3)
m = multiprocessing.Manager()
q = m.Queue()
workers = pool.apply_async(worker, (33, q))

multiprocessing.Pool already has a shared result-queue, there is no need to additionally involve a Manager.Queue. Manager.Queue is a queue.Queue (multithreading-queue) under the hood, located on a separate server-process and exposed via proxies. This adds additional overhead compared to Pool's internal queue. Contrary to relying on Pool's native result-handling, the results in the Manager.Queue also are not guaranteed to be ordered.
The worker processes are not started with .apply_async(), this already happens when you instantiate Pool. What is started
when you call pool.apply_async() is a new "job". Pool's worker-processes run the multiprocessing.pool.worker-function under the hood. This function takes care of processing new "tasks" transferred over Pool's internal Pool._inqueue and of sending results back to the parent over the Pool._outqueue. Your specified func will be executed within multiprocessing.pool.worker. func only has to return something and the result will be automatically send back to the parent.
.apply_async() immediately (asynchronously) returns a AsyncResult object (alias for ApplyResult). You need to call .get() (is blocking) on that object to receive the actual result. Another option would be to register a callback function, which gets fired as soon as the result becomes ready.
from multiprocessing import Pool
def busy_foo(i):
"""Dummy function simulating cpu-bound work."""
for _ in range(int(10e6)): # do stuff
pass
return i
if __name__ == '__main__':
with Pool(4) as pool:
print(pool._outqueue) # DEMO
results = [pool.apply_async(busy_foo, (i,)) for i in range(10)]
# `.apply_async()` immediately returns AsyncResult (ApplyResult) object
print(results[0]) # DEMO
results = [res.get() for res in results]
print(f'result: {results}')
Example Output:
<multiprocessing.queues.SimpleQueue object at 0x7fa124fd67f0>
<multiprocessing.pool.ApplyResult object at 0x7fa12586da20>
result: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
Note: Specifying the timeout-parameter for .get() will not stop the actual processing of the task within the worker, it only unblocks the waiting parent by raising a multiprocessing.TimeoutError.

Related

Python Multiprocessing - 'Queue' object has no attribute 'task_done' / 'join'

I am rewriting a threaded process into a multiprocessing queue to attempt to speed up a large calculation. I have gotten it 95% of the way there, but I can't figure out how to signal when the Queue is empty using multiprocessing.
My original code is something like this:
import Queue
from threading import Thread
num_fetch_threads = 4
enclosure_queue = Queue()
for i in range(num_fetch_threads):
worker = Thread(target=run_experiment, args=(i, enclosure_queue))
worker.setDaemon(True)
worker.start()
for experiment in experiment_collection:
enclosure_queue.put((experiment, otherVar))
enclosure_queue.join()
And the queue function like this:
def run_experiment(i, q):
while True:
... do stuff ...
q.task_done()
My new code is somethings like this:
from multiprocessing import Process, Queue
num_fetch_threads = 4
enclosure_queue = Queue()
for i in range(num_fetch_threads):
worker = Process(target=run_experiment, args=(i, enclosure_queue))
worker.daemon = True
worker.start()
for experiment in experiment_collection:
enclosure_queue.put((experiment, otherVar))
worker.join() ## I only put this here bc enclosure_queue.join() is not available
And the new queue function:
def run_experiment(i, q):
while True:
... do stuff ...
## not sure what should go here
I have been reading the docs and Google, but can't figure out what I am missing - I know that task_done / join are not part of the multiprocessing Queue class, but it's not clear what I am supposed to use.
"They differ in that Queue lacks the task_done() and join() methods
introduced into Python 2.5’s Queue.Queue class." Source
But without either of those, I'm not sure how the queue knows it is done, and how to continue on with the program.
Consider using a multiprocessing.Pool instead of managing workers manually. Pool handles dispatching tasks to workers, with convenient functions like map and apply, and supports .close and .join methods. Pool takes care of handling the queues between processes and processing the results. Here's how your code might look like using multiprocessing.Pool:
from multiprocessing import Pool
def do_experiment(exp):
# run the experiment `exp`, will be called by `p.map`
return result
p = Pool() # automatically scales to the number of CPUs available
results = p.map(do_experiment, experiment_collection)
p.close()
p.join()

How to perform multiprocessing for a single function in Python?

I am reading the Multiprocessing topic for Python 3 and trying to incorporate the method into my script, however I receive the following error:
AttributeError: __ exit __
I use Windows 7 with an i-7 8-core processor, I have a large shapefile which I want processed (with the mapping software, QGIS) using all 8 cores preferably. Below is the code I have, I would greatly appreciate any help with this matter:
from multiprocessing import Process, Pool
def f():
general.runalg("qgis:dissolve", Input, False, 'LAYER_ID', Output)
if __name__ == '__main__':
with Pool(processes=8) as pool:
result = pool.apply_async(f)
The context manager feature of multiprocessing.Pool was only added into Python 3.3:
New in version 3.3: Pool objects now support the context
management protocol – see Context Manager Types. __enter__() returns
the pool object, and __exit__() calls terminate().
The fact that __exit__ is not defined suggests you're using 3.2 or earlier. You'll need to manually call terminate on the Pool to get equivalent behavior:
if __name__ == '__main__':
pool = Pool(processes=8)
try:
result = pool.apply_async(f)
finally:
pool.terminate()
That said, you probably don't want to use terminate (or the with statement, by extension) here. The __exit__ method of the Pool calls terminate, which will forcibly exit your workers, even if they're not done with their work. You probably want to actually wait for the worker to finish before you exit, which means you should call close() instead, and then use join to wait for all the workers to finish before exiting:
if __name__ == '__main__':
pool = Pool(processes=8)
result = pool.apply_async(f)
pool.close()
pool.join()

Does a queue make sense for multiprocessing even when you only expect a single result back?

I want to execute a function in another process and get a single result back (either true or false). I know the common way of getting results back from multiprocessing is using a queue, but does it make sense if I only expect a single result back?
p = Process(target=my_function_that_returns_boolean, args=(self, args))
p.start()
p.join()
# success = p.somehow_get_the_result_back
If you prefer to use Process rather than Pool, the documentation tells us that there are two ways to exchange objects between processes.
The first is Queue which you have already seen.
The second is Pipe, which the documentation provides an example for. I have slightly modified the example to show your case of returning a boolean.
from multiprocessing import Process, Pipe
def Foo(conn):
# Do necessary processing here
# ....
# Instead of Return True, we send true
#return True
conn.send(True)
parent_conn, child_conn = Pipe()
p = Process(target=Foo, args=(child_conn,))
p.start()
print parent_conn.recv()
p.join()
Queues are used to synchronize access to shared resources in a parallel environment. Common scenarios are when many workers consume tasks from a shared pool or when one execution line creates tasks and another consumes them.
If I understand correctly, it isn't an issue here. So there is no need to use queues. The only synchronization mechanism you need is one that tells one process that the other is done. This is achieved by using join().
Unless there is a real problem just keep things as simple as possible.
You can use a Pool which returns an AsyncResult object
from multiprocessing import Pool
pool = Pool(processes=1)
result = pool.apply_async(my_function_that_returns_boolean, [(self, args),])
success = result.get(timeout=None)

How to list Processes started by multiprocessing Pool?

While attempting to store multiprocessing's process instance in multiprocessing list-variable 'poolList` I am getting a following exception:
SimpleQueue objects should only be shared between processes through inheritance
The reason why I would like to store the PROCESS instances in a variable is to be able to terminate all or just some of them later (if for example a PROCESS freezes). If storing a PROCESS in variable is not an option I would like to know how to get or to list all the PROCESSES started by mutliprocessing POOL. That would be very similar to what .current_process() method does. Except .current_process gets only a single process while I need all the processes started or all the processes currently running.
Two questions:
Is it even possible to store an instance of the Process (as a result of mp.current_process()
Currently I am only able to get a single process from inside of the function that the process is running (from inside of myFunct() using .current_process() method).
Instead I would like to to list all the processes currently running by multiprocessing. How to achieve it?
import multiprocessing as mp
poolList=mp.Manager().list()
def myFunct(arg):
print 'myFunct(): current process:', mp.current_process()
try: poolList.append(mp.current_process())
except Exception, e: print e
for i in range(110):
for n in range(500000):
pass
poolDict[arg]=i
print 'myFunct(): completed', arg, poolDict
from multiprocessing import Pool
pool = Pool(processes=2)
myArgsList=['arg1','arg2','arg3']
pool=Pool(processes=2)
pool.map_async(myFunct, myArgsList)
pool.close()
pool.join()
To list the processes started by a Pool()-instance(which is what you mean if I understand you correctly), there is the pool._pool-list. And it contains the instances of the processes.
However, it is not part of the documented interface and hence, really should not be used.
BUT...it seems a little bit unlikely that it would change just like that anyway. I mean, should they stop having an internal list of processes in the pool? And not call that _pool?
And also, it annoys me that there at least isn't a get processes-method. Or something.
And handling it breaking due to some name change should not be that difficult.
But still, use at your own risk:
from multiprocessing import pool
# Have to run in main
if __name__ == '__main__':
# Create 3 worker processes
_my_pool = pool.Pool(3)
# Loop, terminate, and remove from the process list
# Use a copy [:] of the list to remove items correctly
for _curr_process in _my_pool._pool[:]:
print("Terminating process "+ str(_curr_process.pid))
_curr_process.terminate()
_my_pool._pool.remove(_curr_process)
# If you call _repopulate, the pool will again contain 3 worker processes.
_my_pool._repopulate_pool()
for _curr_process in _my_pool._pool[:]:
print("After repopulation "+ str(_curr_process.pid))
The example creates a pool and manually terminates all processes.
It is important that you remember to delete the process you terminate from the pool yourself i you want Pool() to continue working as usual.
_my_pool._repopulate increases the number of working processes to 3 again, not needed to answer the question, but gives a little bit of behind-the-scenes insight.
Yes you can get all active process and perform action based on name of process
e.g
multiprocessing.Process(target=foo, name="refresh-reports")
and then
for p in multiprocessing.active_children():
if p.name == "foo":
p.terminate()
You're creating a managed List object, but then letting the associated Manager object expire.
Process objects are shareable because they aren't pickle-able; that is, they aren't simple.
Oddly the multiprocessing module doesn't have the equivalent of threading.enumerate() -- that is, you can't list all outstanding processes. As a workaround, I just store procs in a list. I never terminate() a process, but do sys.exit(0) in the parent. It's rough, because the workers will leave things in an inconsistent state, but it's okay for smaller programs
To kill a frozen worker, I suggest: 1) worker receives "heartbeat" jobs in a queue every now and then, 2) if parent notices worker A hasn't responded to a heartbeat in a certain amount of time, then p.terminate(). Consider restating the problem in another SO question, as it's interesting.
To be honest the map stuff is much easier than using a Manager.
Here's a Manager example I've used. A worker adds stuff to a shared list. Another worker occasionally wakes up, processes everything on the list, then goes back to sleep. The code also has verbose logs, which are essential for ease in debugging.
source
# producer adds to fixed-sized list; scanner uses them
import logging, multiprocessing, sys, time
def producer(objlist):
'''
add an item to list every sec; ensure fixed size list
'''
logger = multiprocessing.get_logger()
logger.info('start')
while True:
try:
time.sleep(1)
except KeyboardInterrupt:
return
msg = 'ding: {:04d}'.format(int(time.time()) % 10000)
logger.info('put: %s', msg)
del objlist[0]
objlist.append( msg )
def scanner(objlist):
'''
every now and then, run calculation on objlist
'''
logger = multiprocessing.get_logger()
logger.info('start')
while True:
try:
time.sleep(5)
except KeyboardInterrupt:
return
logger.info('items: %s', list(objlist))
def main():
logger = multiprocessing.log_to_stderr(
level=logging.INFO
)
logger.info('setup')
# create fixed-length list, shared between producer & consumer
manager = multiprocessing.Manager()
my_objlist = manager.list( # pylint: disable=E1101
[None] * 10
)
multiprocessing.Process(
target=producer,
args=(my_objlist,),
name='producer',
).start()
multiprocessing.Process(
target=scanner,
args=(my_objlist,),
name='scanner',
).start()
logger.info('running forever')
try:
manager.join() # wait until both workers die
except KeyboardInterrupt:
pass
logger.info('done')
if __name__=='__main__':
main()

The purpose of the pool in python multiprocessing

I'm having difficulty understanding the purpose of the pool in Python's multiprocessing module.
I know what this code is doing:
import multiprocessing
def worker():
"""worker function"""
print 'Worker'
return
if __name__ == '__main__':
jobs = []
for i in range(5):
p = multiprocessing.Process(target=worker)
jobs.append(p)
p.start()
So my question is, in what type of situation would a pool be used?
Pool objects are useful when you want to be able to submit more tasks to sub-processes, but you don't want to handle all the organization of these tasks(i.e. how many processes should be spawned to handle them; which task go to which process etc.) and you care only for the result value, and not any other kind of synchronisation etc. You don't want to have the control over the sub-process computation but simply the result.
On the other hand Process is used when you want to execute a specific action, and you need control over the sub-process, not only on the result of its computation.

Categories