How to perform multiprocessing for a single function in Python?

How to perform multiprocessing for a single function in Python? - python

I am reading the Multiprocessing topic for Python 3 and trying to incorporate the method into my script, however I receive the following error:
AttributeError: __ exit __
I use Windows 7 with an i-7 8-core processor, I have a large shapefile which I want processed (with the mapping software, QGIS) using all 8 cores preferably. Below is the code I have, I would greatly appreciate any help with this matter:
from multiprocessing import Process, Pool
def f():
general.runalg("qgis:dissolve", Input, False, 'LAYER_ID', Output)
if __name__ == '__main__':
with Pool(processes=8) as pool:
result = pool.apply_async(f)

The context manager feature of multiprocessing.Pool was only added into Python 3.3:
New in version 3.3: Pool objects now support the context
management protocol – see Context Manager Types. __enter__() returns
the pool object, and __exit__() calls terminate().
The fact that __exit__ is not defined suggests you're using 3.2 or earlier. You'll need to manually call terminate on the Pool to get equivalent behavior:
if __name__ == '__main__':
pool = Pool(processes=8)
try:
result = pool.apply_async(f)
finally:
pool.terminate()
That said, you probably don't want to use terminate (or the with statement, by extension) here. The __exit__ method of the Pool calls terminate, which will forcibly exit your workers, even if they're not done with their work. You probably want to actually wait for the worker to finish before you exit, which means you should call close() instead, and then use join to wait for all the workers to finish before exiting:
if __name__ == '__main__':
pool = Pool(processes=8)
result = pool.apply_async(f)
pool.close()
pool.join()

Related

will 'multiprocessing' automatically close finished child processes?

I used the multiprocessing lib to create multi-thread to process a list of files(20+ files).
When I run the py file, I set the pool number as 4. But in cmd, it showed there are over 10 processes. And most of them have been running for a long time. Because it's large file and takes long time to process so I'm not sure if the process is hanging or still executing.
So my question is:
if it's executing, how to set the process number as exactly 4?
if it's hanging, it means child process will not shut down after finished. Can I set it automatically shutting down after finished?
from multiprocessing import Pool
poolNum = int(sys.argv[1])
pool = Pool(poolNum)
pool.map(processFunc, fileList)

It won't, not until the Pool is close-ed or terminate-ed (IIRC Pools at least at present have a reference cycle involved, so even when the last live reference to the Pool goes away, the Pool is not deterministically collected, even on CPython, which uses reference counting and normally has deterministic behavior).
Since you're using map, your work is definitely done when map returns, so the simplest solution is just to use a with statement for guaranteed termination:
from multiprocessing import Pool
def main():
poolNum = int(sys.argv[1])
with Pool(poolNum) as pool: # Pool created
pool.map(processFunc, fileList)
# terminate has been called, all workers will be killed
# Adding main guard so this code is valid on Windows and anywhere else which
# doesn't use forking for whatever reason
if __name__ == '__main__':
main()
As I commented, I used a main function with the standard guard against being invoked on import, as Windows (and on 3.8+ macOS, plus any OS if the script opts into the 'spawn' startmethod) simulates forking by reimporting the main module (but not naming it __main__); without the guard, you can end up with the child process creating new processes automatically, which is problematic.
Side-note: If you are dispatching a bunch of tasks but not waiting on them immediately (so you don't want to terminate the pool anywhere near when you create it, but want to ensure the workers are cleaned up promptly), you can still use context management to help out. Just use contextlib.closing to close the pool once all the tasks are dispatched; you must dispatch all the tasks before the end of the with block, but you can retrieve the results later, and when all results are computed, the child processes will close. For example:
from contextlib import closing
from multiprocessing import Pool
def main():
poolNum = int(sys.argv[1])
with closing(Pool(poolNum)) as pool: # Pool created
results = pool.imap_unordered(processFunc, fileList)
# close has been called, so no new work can be submitted,
# and when all outstanding tasks complete, the workers will exit
# immediately/cleanly
for res in results:
# Can still retrieve results even after pool is closed
# Adding main guard so this code is valid on Windows and anywhere else which
# doesn't use forking for whatever reason
if __name__ == '__main__':
main()

Pythons parallel processing

I am in the following setting: I have a method that takes an objective function f as input. As a subrouting of that method i want to evaluate f on a small set of points. Since f has high complexity i considered doing that in parallel.
All online examples hang up even for trivial functions like squaring on sets with 5 points. They are using the multiprocessing library - and i don't know what i am doing wrong. I am not sure how to encapsulate that __name__ == "__main__" statement in my method. (since it is part of a module - i guess instead of "__main__" i should use the module name?)
Code i have been using looks like
from multiprocessing.pool import Pool
from multiprocessing import cpu_count
x = [1,2,3,4,5]
num_cores = cpu_count()
def f(x):
return x**2
if __name__ == "__main__":
pool = Pool(num_cores)
y = list(pool.map(f, x))
pool.join()
print(y)
When executing this code in my spyder it takes a bloody long time to finish.
So my main questions are: What am i doing wrong in this code? How can i encapsulate the __name__-statement, when this code is part of a bigger method?
Is it even worth it parallelizing this? (one function evaluation can take multiple minutes and in serial this adds up to a total runtime of hours...)

According to documentation :
close()
Prevents any more tasks from being submitted to the pool. Once all the tasks have been completed the worker processes will exit.
terminate()
Stops the worker processes immediately without completing outstanding work. When the pool object is garbage collected
terminate() will be called immediately.
join()
Wait for the worker processes to exit. One must call close() or terminate() before using join().
So you should add :
from multiprocessing.pool import Pool
from multiprocessing import cpu_count
x = [1,2,3,4,5]
def f(x):
return x**2
if __name__ == "__main__":
pool = Pool()
y = list(pool.map(f, x))
pool.close()
pool.join()
print(y)
You can call Pool without any argument and it will use cpu_count by default
If processes is None then the number returned by cpu_count() is used
About the if name == "main", read more informations here.
So you need to think a bit about which code you want executed only in the main program. The most obvious example is that you want code that creates child processes to run only in the main program - so that should be protected by name == 'main'

You might want to look into the chunksize argument of the map function that you are using.
On a large enough input list, a lot of your time is spent simply communicating the arguments to and from the separate parallel processes.
One symptom of this problem is that when you use something like htop all cores are firing but at < 100%.

unexpected behaviour of multiprocessing Pool map_async

I have some code that does the same thing to several files in a python 3 application and so seems like a great candidate for multiprocessing. I'm trying to use Pool to assign work to some number of processes. I'd like the code to continue do other things (mainly displaying things for the user) while these calculations are going on, so i'd like to use the map_async function of the multiprocessing.Pool class for this. I would expect that after calling this, the code will continue and the result will be handled by the callback I've specified, but this doesn't seem to be happening. The following code shows three ways I've tried calling map_async and the results I've seen:
import multiprocessing
NUM_PROCS = 4
def func(arg_list):
arg1 = arg_list[0]
arg2 = arg_list[1]
print('start func')
print ('arg1 = {0}'.format(arg1))
print ('arg2 = {0}'.format(arg2))
time.sleep(1)
result1 = arg1 * arg2
print('end func')
return result1
def callback(result):
print('result is {0}'.format(result))
def error_handler(error1):
print('error in call\n {0}'.format(error1))
def async1(arg_list1):
# This is how my understanding of map_async suggests i should
# call it. When I execute this, the target function func() is not called
with multiprocessing.Pool(NUM_PROCS) as p1:
r1 = p1.map_async(func,
arg_list1,
callback=callback,
error_callback=error_handler)
def async2(arg_list1):
with multiprocessing.Pool(NUM_PROCS) as p1:
# If I call the wait function on the result for a small
# amount of time, then the target function func() is called
# and executes sucessfully in 2 processes, but the callback
# function is never called so the results are not processed
r1 = p1.map_async(func,
arg_list1,
callback=callback,
error_callback=error_handler)
r1.wait(0.1)
def async3(arg_list1):
# if I explicitly call join on the pool, then the target function func()
# successfully executes in 2 processes and the callback function is also
# called, but by calling join the processing is not asynchronous any more
# as join blocks the main process until the other processes are finished.
with multiprocessing.Pool(NUM_PROCS) as p1:
r1 = p1.map_async(func,
arg_list1,
callback=callback,
error_callback=error_handler)
p1.close()
p1.join()
def main():
arg_list1 = [(5, 3), (7, 4), (-8, 10), (4, 12)]
async3(arg_list1)
print('pool executed successfully')
if __name__ == '__main__':
main()
When async1, async2 or async3 is called in main, the results are described in the comments for each function. Could any one explain why the different calls are behaving the way they are? Ultimately I'd like to call map_async as done in async1, so i can do something in else the main process while the worker processes are busy. I have tested this code with python 2.7 and 3.6, on an older RH6 linux box and a newer ubuntu VM, with the same results.

This is happening because when you use the multiprocessing.Pool as a context manager, pool.terminate() is called when you leave the with block, which immediately exits all workers, without waiting for in-progress tasks to finish.
New in version 3.3: Pool objects now support the context management protocol – see Context Manager Types. __enter__() returns the pool object, and __exit__() calls terminate().
IMO using terminate() as the __exit__ method of the context manager wasn't a great design choice, since it seems most people intuitively expect close() will be called, which will wait for in-progress tasks to complete before exiting. Unfortunately all you can do is refactor your code away from using a context manager, or refactor your code so that you guarantee you don't leave the with block until the Pool is done doing its work.

How to list Processes started by multiprocessing Pool?

While attempting to store multiprocessing's process instance in multiprocessing list-variable 'poolList` I am getting a following exception:
SimpleQueue objects should only be shared between processes through inheritance
The reason why I would like to store the PROCESS instances in a variable is to be able to terminate all or just some of them later (if for example a PROCESS freezes). If storing a PROCESS in variable is not an option I would like to know how to get or to list all the PROCESSES started by mutliprocessing POOL. That would be very similar to what .current_process() method does. Except .current_process gets only a single process while I need all the processes started or all the processes currently running.
Two questions:
Is it even possible to store an instance of the Process (as a result of mp.current_process()
Currently I am only able to get a single process from inside of the function that the process is running (from inside of myFunct() using .current_process() method).
Instead I would like to to list all the processes currently running by multiprocessing. How to achieve it?
import multiprocessing as mp
poolList=mp.Manager().list()
def myFunct(arg):
print 'myFunct(): current process:', mp.current_process()
try: poolList.append(mp.current_process())
except Exception, e: print e
for i in range(110):
for n in range(500000):
pass
poolDict[arg]=i
print 'myFunct(): completed', arg, poolDict
from multiprocessing import Pool
pool = Pool(processes=2)
myArgsList=['arg1','arg2','arg3']
pool=Pool(processes=2)
pool.map_async(myFunct, myArgsList)
pool.close()
pool.join()

To list the processes started by a Pool()-instance(which is what you mean if I understand you correctly), there is the pool._pool-list. And it contains the instances of the processes.
However, it is not part of the documented interface and hence, really should not be used.
BUT...it seems a little bit unlikely that it would change just like that anyway. I mean, should they stop having an internal list of processes in the pool? And not call that _pool?
And also, it annoys me that there at least isn't a get processes-method. Or something.
And handling it breaking due to some name change should not be that difficult.
But still, use at your own risk:
from multiprocessing import pool
# Have to run in main
if __name__ == '__main__':
# Create 3 worker processes
_my_pool = pool.Pool(3)
# Loop, terminate, and remove from the process list
# Use a copy [:] of the list to remove items correctly
for _curr_process in _my_pool._pool[:]:
print("Terminating process "+ str(_curr_process.pid))
_curr_process.terminate()
_my_pool._pool.remove(_curr_process)
# If you call _repopulate, the pool will again contain 3 worker processes.
_my_pool._repopulate_pool()
for _curr_process in _my_pool._pool[:]:
print("After repopulation "+ str(_curr_process.pid))
The example creates a pool and manually terminates all processes.
It is important that you remember to delete the process you terminate from the pool yourself i you want Pool() to continue working as usual.
_my_pool._repopulate increases the number of working processes to 3 again, not needed to answer the question, but gives a little bit of behind-the-scenes insight.

Yes you can get all active process and perform action based on name of process
e.g
multiprocessing.Process(target=foo, name="refresh-reports")
and then
for p in multiprocessing.active_children():
if p.name == "foo":
p.terminate()

You're creating a managed List object, but then letting the associated Manager object expire.
Process objects are shareable because they aren't pickle-able; that is, they aren't simple.
Oddly the multiprocessing module doesn't have the equivalent of threading.enumerate() -- that is, you can't list all outstanding processes. As a workaround, I just store procs in a list. I never terminate() a process, but do sys.exit(0) in the parent. It's rough, because the workers will leave things in an inconsistent state, but it's okay for smaller programs
To kill a frozen worker, I suggest: 1) worker receives "heartbeat" jobs in a queue every now and then, 2) if parent notices worker A hasn't responded to a heartbeat in a certain amount of time, then p.terminate(). Consider restating the problem in another SO question, as it's interesting.
To be honest the map stuff is much easier than using a Manager.
Here's a Manager example I've used. A worker adds stuff to a shared list. Another worker occasionally wakes up, processes everything on the list, then goes back to sleep. The code also has verbose logs, which are essential for ease in debugging.
source
# producer adds to fixed-sized list; scanner uses them
import logging, multiprocessing, sys, time
def producer(objlist):
'''
add an item to list every sec; ensure fixed size list
'''
logger = multiprocessing.get_logger()
logger.info('start')
while True:
try:
time.sleep(1)
except KeyboardInterrupt:
return
msg = 'ding: {:04d}'.format(int(time.time()) % 10000)
logger.info('put: %s', msg)
del objlist[0]
objlist.append( msg )
def scanner(objlist):
'''
every now and then, run calculation on objlist
'''
logger = multiprocessing.get_logger()
logger.info('start')
while True:
try:
time.sleep(5)
except KeyboardInterrupt:
return
logger.info('items: %s', list(objlist))
def main():
logger = multiprocessing.log_to_stderr(
level=logging.INFO
)
logger.info('setup')
# create fixed-length list, shared between producer & consumer
manager = multiprocessing.Manager()
my_objlist = manager.list( # pylint: disable=E1101
[None] * 10
)
multiprocessing.Process(
target=producer,
args=(my_objlist,),
name='producer',
).start()
multiprocessing.Process(
target=scanner,
args=(my_objlist,),
name='scanner',
).start()
logger.info('running forever')
try:
manager.join() # wait until both workers die
except KeyboardInterrupt:
pass
logger.info('done')
if __name__=='__main__':
main()

Sharing a result queue among several processes

The documentation for the multiprocessing module shows how to pass a queue to a process started with multiprocessing.Process. But how can I share a queue with asynchronous worker processes started with apply_async? I don't need dynamic joining or anything else, just a way for the workers to (repeatedly) report their results back to base.
import multiprocessing
def worker(name, que):
que.put("%d is done" % name)
if __name__ == '__main__':
pool = multiprocessing.Pool(processes=3)
q = multiprocessing.Queue()
workers = pool.apply_async(worker, (33, q))
This fails with:
RuntimeError: Queue objects should only be shared between processes through inheritance.
I understand what this means, and I understand the advice to inherit rather than require pickling/unpickling (and all the special Windows restrictions). But how do I pass the queue in a way that works? I can't find an example, and I've tried several alternatives that failed in various ways. Help please?

Try using multiprocessing.Manager to manage your queue and to also make it accessible to different workers.
import multiprocessing
def worker(name, que):
que.put("%d is done" % name)
if __name__ == '__main__':
pool = multiprocessing.Pool(processes=3)
m = multiprocessing.Manager()
q = m.Queue()
workers = pool.apply_async(worker, (33, q))

multiprocessing.Pool already has a shared result-queue, there is no need to additionally involve a Manager.Queue. Manager.Queue is a queue.Queue (multithreading-queue) under the hood, located on a separate server-process and exposed via proxies. This adds additional overhead compared to Pool's internal queue. Contrary to relying on Pool's native result-handling, the results in the Manager.Queue also are not guaranteed to be ordered.
The worker processes are not started with .apply_async(), this already happens when you instantiate Pool. What is started
when you call pool.apply_async() is a new "job". Pool's worker-processes run the multiprocessing.pool.worker-function under the hood. This function takes care of processing new "tasks" transferred over Pool's internal Pool._inqueue and of sending results back to the parent over the Pool._outqueue. Your specified func will be executed within multiprocessing.pool.worker. func only has to return something and the result will be automatically send back to the parent.
.apply_async() immediately (asynchronously) returns a AsyncResult object (alias for ApplyResult). You need to call .get() (is blocking) on that object to receive the actual result. Another option would be to register a callback function, which gets fired as soon as the result becomes ready.
from multiprocessing import Pool
def busy_foo(i):
"""Dummy function simulating cpu-bound work."""
for _ in range(int(10e6)): # do stuff
pass
return i
if __name__ == '__main__':
with Pool(4) as pool:
print(pool._outqueue) # DEMO
results = [pool.apply_async(busy_foo, (i,)) for i in range(10)]
# `.apply_async()` immediately returns AsyncResult (ApplyResult) object
print(results[0]) # DEMO
results = [res.get() for res in results]
print(f'result: {results}')
Example Output:
<multiprocessing.queues.SimpleQueue object at 0x7fa124fd67f0>
<multiprocessing.pool.ApplyResult object at 0x7fa12586da20>
result: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
Note: Specifying the timeout-parameter for .get() will not stop the actual processing of the task within the worker, it only unblocks the waiting parent by raising a multiprocessing.TimeoutError.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to perform multiprocessing for a single function in Python? - python

Related

will 'multiprocessing' automatically close finished child processes?

Pythons parallel processing

unexpected behaviour of multiprocessing Pool map_async

How to list Processes started by multiprocessing Pool?

Sharing a result queue among several processes

Categories

Resources