unexpected behaviour of multiprocessing Pool map_async - python

I have some code that does the same thing to several files in a python 3 application and so seems like a great candidate for multiprocessing. I'm trying to use Pool to assign work to some number of processes. I'd like the code to continue do other things (mainly displaying things for the user) while these calculations are going on, so i'd like to use the map_async function of the multiprocessing.Pool class for this. I would expect that after calling this, the code will continue and the result will be handled by the callback I've specified, but this doesn't seem to be happening. The following code shows three ways I've tried calling map_async and the results I've seen:
import multiprocessing
NUM_PROCS = 4
def func(arg_list):
arg1 = arg_list[0]
arg2 = arg_list[1]
print('start func')
print ('arg1 = {0}'.format(arg1))
print ('arg2 = {0}'.format(arg2))
time.sleep(1)
result1 = arg1 * arg2
print('end func')
return result1
def callback(result):
print('result is {0}'.format(result))
def error_handler(error1):
print('error in call\n {0}'.format(error1))
def async1(arg_list1):
# This is how my understanding of map_async suggests i should
# call it. When I execute this, the target function func() is not called
with multiprocessing.Pool(NUM_PROCS) as p1:
r1 = p1.map_async(func,
arg_list1,
callback=callback,
error_callback=error_handler)
def async2(arg_list1):
with multiprocessing.Pool(NUM_PROCS) as p1:
# If I call the wait function on the result for a small
# amount of time, then the target function func() is called
# and executes sucessfully in 2 processes, but the callback
# function is never called so the results are not processed
r1 = p1.map_async(func,
arg_list1,
callback=callback,
error_callback=error_handler)
r1.wait(0.1)
def async3(arg_list1):
# if I explicitly call join on the pool, then the target function func()
# successfully executes in 2 processes and the callback function is also
# called, but by calling join the processing is not asynchronous any more
# as join blocks the main process until the other processes are finished.
with multiprocessing.Pool(NUM_PROCS) as p1:
r1 = p1.map_async(func,
arg_list1,
callback=callback,
error_callback=error_handler)
p1.close()
p1.join()
def main():
arg_list1 = [(5, 3), (7, 4), (-8, 10), (4, 12)]
async3(arg_list1)
print('pool executed successfully')
if __name__ == '__main__':
main()
When async1, async2 or async3 is called in main, the results are described in the comments for each function. Could any one explain why the different calls are behaving the way they are? Ultimately I'd like to call map_async as done in async1, so i can do something in else the main process while the worker processes are busy. I have tested this code with python 2.7 and 3.6, on an older RH6 linux box and a newer ubuntu VM, with the same results.

This is happening because when you use the multiprocessing.Pool as a context manager, pool.terminate() is called when you leave the with block, which immediately exits all workers, without waiting for in-progress tasks to finish.
New in version 3.3: Pool objects now support the context management protocol – see Context Manager Types. __enter__() returns the pool object, and __exit__() calls terminate().
IMO using terminate() as the __exit__ method of the context manager wasn't a great design choice, since it seems most people intuitively expect close() will be called, which will wait for in-progress tasks to complete before exiting. Unfortunately all you can do is refactor your code away from using a context manager, or refactor your code so that you guarantee you don't leave the with block until the Pool is done doing its work.

Related

What is the true nature of "blocking" in the context of pool.map() and pool.join()?

I am trying to sketch a picture for myself of how to appropriately use Pool object.
I have a slightly more complex task, but here's the gist:
def func1(x):
return x*2
def func2(x):
return np.sqrt(x)
with Pool(os.cpu_count()) as p:
x = p.map(func1, range(1000))
x = p.map(func2, x)
Then comes some documentation of pool.map and pool.join:
map(func, iterable[, chunksize]):
A parallel equivalent of the map() built-in function (it supports only
one iterable argument though, for multiple iterables see starmap()).
It blocks until the result is ready.
And
join()
Wait for the worker processes to exit. One must call close() or
terminate() before using join().
I don't have strong understanding of what "block" means, but it seems like if I call x = p.map(func1, arg) followed by y = p.map(func2, x) that the pool will be strictly assigned to the first task until it is complete, and then it will be allowed to work on the next task.
Question 1: Is that understanding correct?
If my understanding is correct, it seems like I don't need to use p.join() as it seems to do the same thing (blocks the pool from being used until it's finished with its current job).
Question 2: Do I need to use p.join() for a task like this one?
Finally, I see pool.close(), which "Prevents any more tasks from being submitted to the pool. Once all the tasks have been completed the worker processes will exit". How can more tasks be submitted without me telling it?
Question 3: Do I need to do anything after all the work is done, like call p.close()?
You can create Processes and Pools directly (and start and stop them manually) or you use the with construct (as you did) so that it is automatically handled for you.
This should give you the same result as your code:
p = Pool(os.cpu_count())
x = p.map(func1, range(1000))
x = p.map(func2, x)
p.close()
About Pool.join()
Pool.join() waits for the worker processes to exit (terminated).
Pool.map() blocks until the result is ready. At the same time, the processes in the pool do not terminate but are ready to accept new tasks.
Processes in the pool do not terminate after completing task, so one must call close() or terminate() before using join():
pool = Pool(processes=5)
results = pool.map(func, iterable)
pool.close()
pool.join()
When you use with context manager, you need not to call close or join because exiting the with-block has stopped the pool:
with Pool(processes=5) as pool:
results = pool.map(func, iterable)
In contrast, the Process class terminates after the function is completed:
p = Process(target=f, args=('arg_1',))
p.start()
p.join()
All computer programs are written by people, but the interaction between parts of the system can be quite complex, for example, tasks can be submitted when some events are triggered, requests are received, etc.
See point 1

How to perform multiprocessing for a single function in Python?

I am reading the Multiprocessing topic for Python 3 and trying to incorporate the method into my script, however I receive the following error:
AttributeError: __ exit __
I use Windows 7 with an i-7 8-core processor, I have a large shapefile which I want processed (with the mapping software, QGIS) using all 8 cores preferably. Below is the code I have, I would greatly appreciate any help with this matter:
from multiprocessing import Process, Pool
def f():
general.runalg("qgis:dissolve", Input, False, 'LAYER_ID', Output)
if __name__ == '__main__':
with Pool(processes=8) as pool:
result = pool.apply_async(f)
The context manager feature of multiprocessing.Pool was only added into Python 3.3:
New in version 3.3: Pool objects now support the context
management protocol – see Context Manager Types. __enter__() returns
the pool object, and __exit__() calls terminate().
The fact that __exit__ is not defined suggests you're using 3.2 or earlier. You'll need to manually call terminate on the Pool to get equivalent behavior:
if __name__ == '__main__':
pool = Pool(processes=8)
try:
result = pool.apply_async(f)
finally:
pool.terminate()
That said, you probably don't want to use terminate (or the with statement, by extension) here. The __exit__ method of the Pool calls terminate, which will forcibly exit your workers, even if they're not done with their work. You probably want to actually wait for the worker to finish before you exit, which means you should call close() instead, and then use join to wait for all the workers to finish before exiting:
if __name__ == '__main__':
pool = Pool(processes=8)
result = pool.apply_async(f)
pool.close()
pool.join()

multiprocessing.Pool: calling helper functions when using apply_async's callback option

How does the flow of apply_async work between calling the iterable (?) function and the callback function?
Setup: I am reading some lines of all the files inside a 2000 file directory, some with millions of lines, some with only a few. Some header/formatting/date data is extracted to charecterize each file. This is done on a 16 CPU machine, so it made sense to multiprocess it.
Currently, the expected result is being sent to a list (ahlala) so I can print it out; later, this will be written to *.csv. This is a simplified version of my code, originally based off this extremely helpful post.
import multiprocessing as mp
def dirwalker(directory):
ahlala = []
# X() reads files and grabs lines, calls helper function to calculate
# info, and returns stuff to the callback function
def X(f):
fileinfo = Z(arr_of_lines)
return fileinfo
# Y() reads other types of files and does the same thing
def Y(f):
fileinfo = Z(arr_of_lines)
return fileinfo
# results() is the callback function
def results(r):
ahlala.extend(r) # or .append, haven't yet decided
# helper function
def Z(arr):
return fileinfo # to X() or Y()!
for _,_,files in os.walk(directory):
pool = mp.Pool(mp.cpu_count()
for f in files:
if (filetype(f) == filetypeX):
pool.apply_async(X, args=(f,), callback=results)
elif (filetype(f) == filetypeY):
pool.apply_async(Y, args=(f,), callback=results)
pool.close(); pool.join()
return ahlala
Note, the code works if I put all of Z(), the helper function, into either X(), Y(), or results(), but is this either repetitive or possibly slower than possible? I know that the callback function is called for every function call, but when is the callback function called? Is it after pool.apply_async()...finishes all the jobs for the processes? Shouldn't it be faster if these helper functions were called within the scope (?) of the first function pool.apply_async() takes (in this case, X())? If not, should I just put the helper function in results()?
Other related ideas: Are daemon processes why nothing shows up? I am also very confused about how to queue things, and if this is the problem. This seems like a place to start learning it, but can queuing be safely ignored when using apply_async, or only at a noticable time inefficiency?
You're asking about a whole bunch of different things here, so I'll try to cover it all as best I can:
The function you pass to callback will be executed in the main process (not the worker) as soon as the worker process returns its result. It is executed in a thread that the Pool object creates internally. That thread consumes objects from a result_queue, which is used to get the results from all the worker processes. After the thread pulls the result off the queue, it executes the callback. While your callback is executing, no other results can be pulled from the queue, so its important that the callback finishes quickly. With your example, as soon as one of the calls to X or Y you make via apply_async completes, the result will be placed into the result_queue by the worker process, and then the result-handling thread will pull the result off of the result_queue, and your callback will be executed.
Second, I suspect the reason you're not seeing anything happen with your example code is because all of your worker function calls are failing. If a worker function fails, callback will never be executed. The failure won't be reported at all unless you try to fetch the result from the AsyncResult object returned by the call to apply_async. However, since you're not saving any of those objects, you'll never know the failures occurred. If I were you, I'd try using pool.apply while you're testing so that you see errors as soon as they occur.
The reason the workers are probably failing (at least in the example code you provided) is because X and Y are defined as function inside another function. multiprocessing passes functions and objects to worker processes by pickling them in the main process, and unpickling them in the worker processes. Functions defined inside other functions are not picklable, which means multiprocessing won't be able to successfully unpickle them in the worker process. To fix this, define both functions at the top-level of your module, rather than embedded insice the dirwalker function.
You should definitely continue to call Z from X and Y, not in results. That way, Z can be run concurrently across all your worker processes, rather than having to be run one call at a time in your main process. And remember, your callback function is supposed to be as quick as possible, so you don't hold up processing results. Executing Z in there would slow things down.
Here's some simple example code that's similar to what you're doing, that hopefully gives you an idea of what your code should look like:
import multiprocessing as mp
import os
# X() reads files and grabs lines, calls helper function to calculate
# info, and returns stuff to the callback function
def X(f):
fileinfo = Z(f)
return fileinfo
# Y() reads other types of files and does the same thing
def Y(f):
fileinfo = Z(f)
return fileinfo
# helper function
def Z(arr):
return arr + "zzz"
def dirwalker(directory):
ahlala = []
# results() is the callback function
def results(r):
ahlala.append(r) # or .append, haven't yet decided
for _,_,files in os.walk(directory):
pool = mp.Pool(mp.cpu_count())
for f in files:
if len(f) > 5: # Just an arbitrary thing to split up the list with
pool.apply_async(X, args=(f,), callback=results) # ,error_callback=handle_error # In Python 3, there's an error_callback you can use to handle errors. It's not available in Python 2.7 though :(
else:
pool.apply_async(Y, args=(f,), callback=results)
pool.close()
pool.join()
return ahlala
if __name__ == "__main__":
print(dirwalker("/usr/bin"))
Output:
['ftpzzz', 'findhyphzzz', 'gcc-nm-4.8zzz', 'google-chromezzz' ... # lots more here ]
Edit:
You can create a dict object that's shared between your parent and child processes using the multiprocessing.Manager class:
pool = mp.Pool(mp.cpu_count())
m = multiprocessing.Manager()
helper_dict = m.dict()
for f in files:
if len(f) > 5:
pool.apply_async(X, args=(f, helper_dict), callback=results)
else:
pool.apply_async(Y, args=(f, helper_dict), callback=results)
Then make X and Y take a second argument called helper_dict (or whatever name you want), and you're all set.
The caveat is that this worked by creating a server process that contains a normal dict, and all your other processes talk to that one dict via a Proxy object. So every time you read or write to the dict, you're doing IPC. This makes it a lot slower than a real dict.

Sharing a result queue among several processes

The documentation for the multiprocessing module shows how to pass a queue to a process started with multiprocessing.Process. But how can I share a queue with asynchronous worker processes started with apply_async? I don't need dynamic joining or anything else, just a way for the workers to (repeatedly) report their results back to base.
import multiprocessing
def worker(name, que):
que.put("%d is done" % name)
if __name__ == '__main__':
pool = multiprocessing.Pool(processes=3)
q = multiprocessing.Queue()
workers = pool.apply_async(worker, (33, q))
This fails with:
RuntimeError: Queue objects should only be shared between processes through inheritance.
I understand what this means, and I understand the advice to inherit rather than require pickling/unpickling (and all the special Windows restrictions). But how do I pass the queue in a way that works? I can't find an example, and I've tried several alternatives that failed in various ways. Help please?
Try using multiprocessing.Manager to manage your queue and to also make it accessible to different workers.
import multiprocessing
def worker(name, que):
que.put("%d is done" % name)
if __name__ == '__main__':
pool = multiprocessing.Pool(processes=3)
m = multiprocessing.Manager()
q = m.Queue()
workers = pool.apply_async(worker, (33, q))
multiprocessing.Pool already has a shared result-queue, there is no need to additionally involve a Manager.Queue. Manager.Queue is a queue.Queue (multithreading-queue) under the hood, located on a separate server-process and exposed via proxies. This adds additional overhead compared to Pool's internal queue. Contrary to relying on Pool's native result-handling, the results in the Manager.Queue also are not guaranteed to be ordered.
The worker processes are not started with .apply_async(), this already happens when you instantiate Pool. What is started
when you call pool.apply_async() is a new "job". Pool's worker-processes run the multiprocessing.pool.worker-function under the hood. This function takes care of processing new "tasks" transferred over Pool's internal Pool._inqueue and of sending results back to the parent over the Pool._outqueue. Your specified func will be executed within multiprocessing.pool.worker. func only has to return something and the result will be automatically send back to the parent.
.apply_async() immediately (asynchronously) returns a AsyncResult object (alias for ApplyResult). You need to call .get() (is blocking) on that object to receive the actual result. Another option would be to register a callback function, which gets fired as soon as the result becomes ready.
from multiprocessing import Pool
def busy_foo(i):
"""Dummy function simulating cpu-bound work."""
for _ in range(int(10e6)): # do stuff
pass
return i
if __name__ == '__main__':
with Pool(4) as pool:
print(pool._outqueue) # DEMO
results = [pool.apply_async(busy_foo, (i,)) for i in range(10)]
# `.apply_async()` immediately returns AsyncResult (ApplyResult) object
print(results[0]) # DEMO
results = [res.get() for res in results]
print(f'result: {results}')
Example Output:
<multiprocessing.queues.SimpleQueue object at 0x7fa124fd67f0>
<multiprocessing.pool.ApplyResult object at 0x7fa12586da20>
result: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
Note: Specifying the timeout-parameter for .get() will not stop the actual processing of the task within the worker, it only unblocks the waiting parent by raising a multiprocessing.TimeoutError.

Python and multiprocessing... how to call a function in the main process?

I would like to implement an async callback style function in python... This is what I came up with but I am not sure how to actually return to the main process and call the function.
funcs = {}
def runCallback(uniqueId):
'''
I want this to be run in the main process.
'''
funcs[uniqueId]()
def someFunc(delay, uniqueId):
'''
This function runs in a seperate process and just sleeps.
'''
time.sleep(delay)
### HERE I WANT TO CALL runCallback IN THE MAIN PROCESS ###
# This does not work... It calls runCallback in the separate process:
runCallback(uniqueId)
def setupCallback(func, delay):
uniqueId = id(func)
funcs[uniqueId] = func
proc = multiprocessing.Process(target=func, args=(delay, uniqueId))
proc.start()
return unqiueId
Here is how I want it to work:
def aFunc():
return None
setupCallback(aFunc, 10)
### some code that gets run before aFunc is called ###
### aFunc runs 10s later ###
There is a gotcha here, because I want this to be a bit more complex. Basically when the code in the main process is done running... I want to examine the funcs dict and then run any of the callbacks that have not yet run. This means that runCallback also needs to remove entries from the funcs dict... the funcs dict is not shared with the seperate processes, so I think runCallback needs to be called in the main process???
It is unclear why do you use multiprocessing module here.
To call a function with delay in the same process you could use threading.Timer.
threading.Timer(10, aFunc).start()
Timer has .cancel() method if you'd like to cancel the callback later:
t = threading.Timer(10, runCallback, args=[uniqueId, funcs])
t.start()
timers.append((t, uniqueId))
# do other stuff
# ...
# run callbacks right now
for t, uniqueId in timers:
t.cancel() # after this the `runCallback()` won't be called by Timer()
# if it's not been called already
runCallback(uniqueId, funcs)
Where runCallback() is modified to remove functions to be called:
def runCallback(uniqueId, funcs):
f = funcs.pop(uniqueId, None) # GIL protects this code with some caveats
if f is not None:
f()
To do exactly what you're trying to do, you're going to need to set up a signal handler in the parent process to run the callback (or just remove the callback function that the child runs if it doesn't need access to any of the parent process's memory), and have the child process send a signal, but if your logic gets any more complex, you'll probably need to use another type of inter-process communication (IPC) such as pipes or sockets.
Another possibility is using threads instead of processes, then you can just run the callback from the second thread. You'll need to add a lock to synchronize access to the funcs dict.

Categories