`multiprocessing` `starmap_async` only calls callback once?

`multiprocessing` `starmap_async` only calls callback once? - python

I have the following code which creates a pool for 4 workers, and calls a worker method. the code works fine for the most part. when running I see that different workers are being called to process the work. However calc_completed is never called once at the very end when all workers are complete. is this expected behaviour? I would have expected the callback to happen when each worker is completed.
def calculate_worker(x, y):
print 'working...'
...
def calc_completed(result):
print 'completed: %s'%str(result)
def calc_errored(result):
print 'error: %s'%str(result)
if __name__ == '__main__':
start, stop, step = 1, 1000, 1
ranges = [(n, min(n+step, stop)) for n in xrange(start, stop, step)]
pool = mp.Pool(processes=8)
res = pool.starmap_async(calculate_worker, ranges,
callback=calculate_worker, error_callback=calc_completed)
pool.close()
pool.join()
d = res.get()
print(d)

calc_completed is would only be called should there was any error encountered in the execution of the the mapped function (here: calculate_worker).
Another issue in your code is that you both running calculate_worker function in parallel and using it as a callback. This does not make much sense as calculate_worker will be called twice - first: as a worker function and secondly: as a function to report that the calculation have finished. You should have two different function there.
Given the functions in the snippet you provided I would change it the following way:
res = pool.starmap_async(calculate_worker, ranges,
callback=calc_completed,
error_callback=calc_errored)
If you want to test if calc_errored is called appropriately then you can introduce some random errors in the calculate_worker function to see if it is going to be handled, e.g.
def calculate_worker(x, y):
if (x % 7):
x / (y - y) # division by zero
print 'working...'

Related

Python core usage slower/under 100% with multiprocessing.Pool

Code that runs on one core # 100% actually runs slower when multiprocessed, where it runs on several cores # ~50%.
This question is asked frequently, and the best threads I've found about it (0, 1) give the answer, "It's because the workload isn't heavy enough, so the inter-process communication (IPC) overhead ends up making things slower."
I don't know whether or not this is right, but I've isolated an example where this happens AND doesn't happen for the same workload, and I want to know whether this answer still applies or why it actually happens:
from multiprocessing import Pool
def f(n):
res = 0
for i in range(n):
res += i**2
return res
def single(n):
""" Single core """
for i in range(n):
f(n)
def multi(n):
""" Multi core """
pool = Pool(2)
for i in range(n):
pool.apply_async(f, (n,))
pool.close()
pool.join()
def single_r(n):
""" Single core, returns """
res = 0
for i in range(n):
res = f(n) % 1000 # Prevent overflow
return res
def multi_r(n):
""" Multi core, returns """
pool = Pool(2)
res = 0
for i in range(n):
res = pool.apply_async(f, (n,)).get() % 1000
pool.close()
pool.join()
return res
# Run
n = 5000
if __name__ == "__main__":
print(f"single({n})...", end='')
single(n)
print(" DONE")
print(f"multi({n})...", end='')
multi(n)
print(" DONE")
print(f"single_r({n})...", end='')
single_r(n)
print(" DONE")
print(f"multi_r({n})...", end='')
multi_r(n)
print(" DONE")
The workload is f().
f() is run single-cored and dual-cored without return calls via single() and multi().
Then f() is run single-cored and dual-cored with return calls via single_r() and multi_r().
My result is that slowdown happens when f() is run multiprocessed with return calls. Without returns, it doesn't happen.
So single() takes q seconds. multi() is much faster. Good. Then single_r() takes q seconds. But then multi_r() takes much more than q seconds. Visual inspection of my system monitor corroborates this (a little hard to tell, but the multi(n) hump is shaded two colors, indicating activity from two different cores).
Also, corroborating video of the terminal outputs
Even with uniform workload, is this still IPC overhead? Is such overhead only paid when other processes return their results, and, if so, is there a way to avoid it while still returning results?

As Darkonaut pointed out, the slowdown when using multiple processes in multi_r() is because the get() call is blocking:
for i in range(n):
res = pool.apply_async(f, (n,)).get() % 1000
This effectively runs the workload sequentially or concurrently (more akin to multithreaded) while adding multiprocess overhead, making it run slower than the single-cored equivalent single_r()!
Meanwhile, multi() ran faster (i.e., ran in parallel correctly) because it contains no get() calls.
To run parallel and return results, collect result objects first as in:
def multi_r_collected(n):
""" Multi core, collects apply_async() results before returning them """
pool = Pool(2)
res = 0
res = [pool.apply_async(f, (n,)) for i in range(n)] # Collect first!
pool.close()
pool.join()
res = [r.get() % 1000 for r in res] # .get() after!
return res
Visual inspection of CPU activity corroborates the noticed speed-up; when run with 12 processes via Pool(12), there's a clean, uniform mesa of multiple cores clearly running at 100% in parallel (not the 50% mishmash of multi_r(n)).

Eliminating overhead in multiprocessing with pool

I am currently in a situation where I have parallelized code called repeatedly and try to reduce the overhead associated with the multiprocessing. So, consider the following example, which deliberately contains no "expensive" computations:
import multiprocessing as mp
def f(x):
# toy function
return x*x
if __name__ == '__main__':
for x in range(500):
pool = mp.Pool(processes=2)
print(pool.map(f, range(x, x + 50)))
pool.close()
pool.join() # necessary?
This code takes 53 seconds compared to 0.04 seconds for the sequential approach.
First question: do I really need to call pool.join() in this case when only pool.map() is ever used? I cannot find any negative effects from omitting it and the runtime would drop to 4.8 seconds. (I understand that omitting pool.close() is not possible, as we would be leaking threads then.)
Now, while this would be a nice improvement, as a first answer I would probably get "well, don't create the pool in the loop in the first place". Ok, no problem, but the parallelized code actually lives in an instance method, so I would use:
class MyObject:
def __init__(self):
self.pool = mp.Pool(processes=2)
def function(self, x):
print(self.pool.map(f, range(x, x + 50)))
if __name__ == '__main__':
my_object = MyObject()
for x in range(500):
my_object.function(x)
This would be my favorite solution as it runs in excellent 0.4 seconds.
Second question: should I call pool.close()/pool.join() somewhere explicitly (e.g. in the destructor of MyObject) or is the current code sufficient? (If it matters: it is ok to assume there are only a few long-lived instances of MyObject in my project.)

Of course it takes a long time: you keep allocating a new pool and destroying it for every x.
It will run much faster if instead you do:
if __name__ == '__main__':
pool = mp.Pool(processes=2) # allocate the pool only once
for x in range(500):
print(pool.map(f, range(x, x + 50)))
pool.close() # close it only after all the requests are submitted
pool.join() # wait for the last worker to finish
Try that and you'll see it now works much faster.
Here are links to the docs for join and close:
Once close is called you can't submit more tasks to the pool, and join waits till the last worker finished its job. They should be called in that order (first close then join).

Well, actually you could pass already allocated pool as argument to your object:
class MyObject:
def __init__(self, pool):
self.pool = pool
def function(self, x):
print(self.pool.map(f, range(x, x + 50)))
if __name__ == '__main__':
with mp.Pool(2) as pool:
my_object = MyObject(pool)
my_second_object = MyObject(pool)
for x in range(500):
my_object.function(x)
my_second_object.function(x)
pool.close()
I can not find a reason why it might be necessary to use different pools in different objects

Execute a list of process without multiprocessing pool map

import multiprocessing as mp
if __name__ == '__main__':
#pool = mp.Pool(M)
p1 = mp.Process(target= target1, args= (arg1,))
p2 = mp.Process(target= target2, args= (arg1,))
...
p9 = mp.Process(target= target9, args= (arg9,))
p10 = mp.Process(target= target10, args= (arg10,))
...
pN = mp.Process(target= targetN, args= (argN,))
processList = [p1, p2, .... , p9, p10, ... ,pN]
I have N different target functions which consume unequal non-trivial amount of time to execute.
I am looking for a way to execute them in parallel such that M (1 < M < N) processes are running simultaneously. And as soon as a process is finished next process should start from the list, until all the processes in processList are completed.
As I am not calling the same target function, I could not use Pool.
I considered doing something like this:
for i in range(0, N, M):
limit = i + M
if(limit > N):
limit = N
for p in processList[i:limit]:
p.join()
Since my target functions consume unequal time to execute, this method is not really efficient.
Any suggestions? Thanks in advance.
EDIT:
Question title has been changed to 'Execute a list of process without multiprocessing pool map' from 'Execute a list of process without multiprocessing pool'.

You can use proccess Pool:
#!/usr/bin/env python
# coding=utf-8
from multiprocessing import Pool
import random
import time
def target_1():
time.sleep(random.uniform(0.5, 2))
print('done target 1')
def target_2():
time.sleep(random.uniform(0.5, 2))
print('done target 1')
def target_3():
time.sleep(random.uniform(0.5, 2))
print('done target 1')
def target_4():
time.sleep(random.uniform(0.5, 2))
print('done target 1')
pool = Pool(2) # maximum two processes at time.
pool.apply_async(target_1)
pool.apply_async(target_2)
pool.apply_async(target_3)
pool.apply_async(target_4)
pool.close()
pool.join()
Pool is created specifically for what you need to do - execute many tasks in limited number of processes.
I also suggest you take a look at concurrent.futures library and it's backport to Python 2.7. It has a ProcessPoolExecutor, which has roughly same capabilities, but it's methods returns Future objects, and they has a nicer API.

Here is a way to do it in Python 3.4, which could be adapted for Python 2.7 :
targets_with_args = [
(target1, arg1),
(target2, arg2),
(target3, arg3),
...
]
with concurrent.futures.ProcessPoolExecutor(max_workers=20) as executor:
futures = [executor.submit(target, arg) for target, arg in targets_with_args]
results = [future.result() for future in concurrent.futures.as_completed(futures)]

I would use a Queue. adding processes to it from processList, and as soon as a process is finished i would remove it from the queue and add another one.
a pseudo code will look like:
from Queue import Queue
q = Queue(m)
# add first process to queue
i = 0
q.put(processList[i])
processList[i].start()
i+=1
while not q.empty():
p=q.get()
# check if process is finish. if not return it to the queue for later checking
if p.is_alive():
p.put(t)
# add another process if there is space and there are more processes to add
if not q.full() and i < len(processList):
q.put(processList[i])
processList[i].start()
i+=1

A simple solution would be to wrap the functions target{1,2,...N} into a single function forward_to_target that forwards to the appropriate target{1,2,...N} function according to the argument that is passed in. If you cannot infer the appropriate target function from the arguments you currently use, replace each argument with a tuple (argX, X), then in the forward_to_target function unpack the tuple and forward to the appropriate function indicated by the X.

You could have two lists of targets and arguments, zip the two together - and send them to a runner function (here it's run_target_on_args):
#!/usr/bin/env python
import multiprocessing as mp
# target functions
targets = [len, str, len, zip]
# arguments for each function
args = [["arg1"], ["arg2"], ["arg3"], [["arg5"], ["arg6"]]]
# applies target function on it's arguments
def run_target_on_args(target_args):
return target_args[0](*target_args[1])
pool = mp.Pool()
print pool.map(run_target_on_args, zip(targets, args))

Python Multiprocessing with a single function

I have a simulation that is currently running, but the ETA is about 40 hours -- I'm trying to speed it up with multi-processing.
It essentially iterates over 3 values of one variable (L), and over 99 values of of a second variable (a). Using these values, it essentially runs a complex simulation and returns 9 different standard deviations. Thus (even though I haven't coded it that way yet) it is essentially a function that takes two values as inputs (L,a) and returns 9 values.
Here is the essence of the code I have:
STD_1 = []
STD_2 = []
# etc.
for L in range(0,6,2):
for a in range(1,100):
### simulation code ###
STD_1.append(value_1)
STD_2.append(value_2)
# etc.
Here is what I can modify it to:
master_list = []
def simulate(a,L):
### simulation code ###
return (a,L,STD_1, STD_2 etc.)
for L in range(0,6,2):
for a in range(1,100):
master_list.append(simulate(a,L))
Since each of the simulations are independent, it seems like an ideal place to implement some sort of multi-threading/processing.
How exactly would I go about coding this?
EDIT: Also, will everything be returned to the master list in order, or could it possibly be out of order if multiple processes are working?
EDIT 2: This is my code -- but it doesn't run correctly. It asks if I want to kill the program right after I run it.
import multiprocessing
data = []
for L in range(0,6,2):
for a in range(1,100):
data.append((L,a))
print (data)
def simulation(arg):
# unpack the tuple
a = arg[1]
L = arg[0]
STD_1 = a**2
STD_2 = a**3
STD_3 = a**4
# simulation code #
return((STD_1,STD_2,STD_3))
print("1")
p = multiprocessing.Pool()
print ("2")
results = p.map(simulation, data)
EDIT 3: Also what are the limitations of multiprocessing. I've heard that it doesn't work on OS X. Is this correct?

Wrap the data for each iteration up into a tuple.
Make a list data of those tuples
Write a function f to process one tuple and return one result
Create p = multiprocessing.Pool() object.
Call results = p.map(f, data)
This will run as many instances of f as your machine has cores in separate processes.
Edit1: Example:
from multiprocessing import Pool
data = [('bla', 1, 3, 7), ('spam', 12, 4, 8), ('eggs', 17, 1, 3)]
def f(t):
name, a, b, c = t
return (name, a + b + c)
p = Pool()
results = p.map(f, data)
print results
Edit2:
Multiprocessing should work fine on UNIX-like platforms such as OSX. Only platforms that lack os.fork (mainly MS Windows) need special attention. But even there it still works. See the multiprocessing documentation.

Here is one way to run it in parallel threads:
import threading
L_a = []
for L in range(0,6,2):
for a in range(1,100):
L_a.append((L,a))
# Add the rest of your objects here
def RunParallelThreads():
# Create an index list
indexes = range(0,len(L_a))
# Create the output list
output = [None for i in indexes]
# Create all the parallel threads
threads = [threading.Thread(target=simulate,args=(output,i)) for i in indexes]
# Start all the parallel threads
for thread in threads: thread.start()
# Wait for all the parallel threads to complete
for thread in threads: thread.join()
# Return the output list
return output
def simulate(list,index):
(L,a) = L_a[index]
list[index] = (a,L) # Add the rest of your objects here
master_list = RunParallelThreads()

Use Pool().imap_unordered if ordering is not important. It will return results in a non-blocking fashion.

Parallel recursive function in Python

How do I parallelize a recursive function in Python?
My function looks like this:
def f(x, depth):
if x==0:
return ...
else :
return [x] + map(lambda x:f(x, depth-1), list_of_values(x))
def list_of_values(x):
# Heavy compute, pure function
When trying to parallelize it with multiprocessing.Pool.map, Windows opens an infinite number of processes and hangs.
What's a good (preferably simple) way to parallelize it (for a single multicore machine)?
Here is the code that hangs:
from multiprocessing import Pool
pool = pool(processes=4)
def f(x, depth):
if x==0:
return ...
else :
return [x] + pool.map(lambda x:f(x, depth-1), list_of_values(x))
def list_of_values(x):
# Heavy compute, pure function

OK, sorry for the problems with this.
I'm going to answer a slightly different question where f() returns the sum of the values in the list. That is because it's not clear to me from your example what the return type of f() would be, and using an integer makes the code simple to understand.
This is complex because there are two different things happening in parallel:
the calculation of the expensive function in the pool
the recursive expansion of f()
I am very careful to only use the pool to calculate the expensive function. In that way we don't get an "explosion" of processes, but because this is asynchronous we need to postpone a lot of work for the callback that the worker calls once the expensive function is done.
More than that, we need to use a countdown latch so that we know when all the separate sub-calls to f() are complete.
There may be a simpler way (I am pretty sure there is, but I need to do other things), but perhaps this gives you an idea of what is possible:
from multiprocessing import Pool, Value, RawArray, RLock
from time import sleep
class Latch:
'''A countdown latch that lets us wait for a job of "n" parts'''
def __init__(self, n):
self.__counter = Value('i', n)
self.__lock = RLock()
def decrement(self):
with self.__lock:
self.__counter.value -= 1
print('dec', self.read())
return self.read() == 0
def read(self):
with self.__lock:
return self.__counter.value
def join(self):
while self.read():
sleep(1)
def list_of_values(x):
'''An expensive function'''
print(x, ': thinking...')
sleep(1)
print(x, ': thought')
return list(range(x))
pool = Pool()
def async_f(x, on_complete=None):
'''Return the sum of the values in the expensive list'''
if x == 0:
on_complete(0) # no list, return 0
else:
n = x # need to know size of result beforehand
latch = Latch(n) # wait for n entires to be calculated
result = RawArray('i', n+1) # where we will assemble the map
def delayed_map(values):
'''This is the callback for the pool async process - it runs
in a separate thread within this process once the
expensive list has been calculated and orchestrates the
mapping of f over the result.'''
result[0] = x # first value in list is x
for (v, i) in enumerate(values):
def callback(fx, i=i):
'''This is the callback passed to f() and is called when
the function completes. If it is the last of all the
calls in the map then it calls on_complete() (ie another
instance of this function) for the calling f().'''
result[i+1] = fx
if latch.decrement(): # have completed list
# at this point result contains [x]+map(f, ...)
on_complete(sum(result)) # so return sum
async_f(v, callback)
# Ask worker to generate list then call delayed_map
pool.apply_async(list_of_values, [x], callback=delayed_map)
def run():
'''Tie into the same mechanism as above, for the final value.'''
result = Value('i')
latch = Latch(1)
def final_callback(value):
result.value = value
latch.decrement()
async_f(6, final_callback)
latch.join() # wait for everything to complete
return result.value
print(run())
PS: I am using Python 3.2 and the ugliness above is because we are delaying computation of the final results (going back up the tree) until later. It's possible something like generators or futures could simplify things.
Also, I suspect you need a cache to avoid needlessly recalculating the expensive function when called with the same argument as earlier.
See also yaniv's answer - which seems to be an alternative way to reverse the order of the evaluation by being explicit about depth.

After thinking about this, I found a simple, not complete, but good enough answer:
# A partially parallel solution. Just do the first level of recursion in parallel. It might be enough work to fill all cores.
import multiprocessing
def f_helper(data):
return f(x=data['x'],depth=data['depth'], recursion_depth=data['recursion_depth'])
def f(x, depth, recursion_depth):
if depth==0:
return ...
else :
if recursion_depth == 0:
pool = multiprocessing.Pool(processes=4)
result = [x] + pool.map(f_helper, [{'x':_x, 'depth':depth-1, 'recursion_depth':recursion_depth+1 } _x in list_of_values(x)])
pool.close()
else:
result = [x] + map(f_helper, [{'x':_x, 'depth':depth-1, 'recursion_depth':recursion_depth+1 } _x in list_of_values(x)])
return result
def list_of_values(x):
# Heavy compute, pure function

I store the main process id initially and transfer it to sub programs.
When I need to start a multiprocessing job, I check the number of children of the main process. If it is less than or equal to the half of my CPU count, then I run it as parallel. If it greater than the half of my CPU count, then I run it serial. In this way, it avoids bottlenecks and uses CPU cores effectively. You can tune the number of cores for your case. For example, you can set it to the exact number of CPU cores, but you should not exceed it.
def subProgramhWrapper(func, args):
func(*args)
parent = psutil.Process(main_process_id)
children = parent.children(recursive=True)
num_cores = int(multiprocessing.cpu_count()/2)
if num_cores >= len(children):
#parallel run
pool = MyPool(num_cores)
results = pool.starmap(subProgram, input_params)
pool.close()
pool.join()
else:
#serial run
for input_param in input_params:
subProgramhWrapper(subProgram, input_param)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

`multiprocessing` `starmap_async` only calls callback once? - python

Related

Python core usage slower/under 100% with multiprocessing.Pool

Eliminating overhead in multiprocessing with pool

Execute a list of process without multiprocessing pool map

Python Multiprocessing with a single function

Parallel recursive function in Python

Categories

Resources