Parallel recursive function in Python - python

How do I parallelize a recursive function in Python?
My function looks like this:
def f(x, depth):
if x==0:
return ...
else :
return [x] + map(lambda x:f(x, depth-1), list_of_values(x))
def list_of_values(x):
# Heavy compute, pure function
When trying to parallelize it with multiprocessing.Pool.map, Windows opens an infinite number of processes and hangs.
What's a good (preferably simple) way to parallelize it (for a single multicore machine)?
Here is the code that hangs:
from multiprocessing import Pool
pool = pool(processes=4)
def f(x, depth):
if x==0:
return ...
else :
return [x] + pool.map(lambda x:f(x, depth-1), list_of_values(x))
def list_of_values(x):
# Heavy compute, pure function

OK, sorry for the problems with this.
I'm going to answer a slightly different question where f() returns the sum of the values in the list. That is because it's not clear to me from your example what the return type of f() would be, and using an integer makes the code simple to understand.
This is complex because there are two different things happening in parallel:
the calculation of the expensive function in the pool
the recursive expansion of f()
I am very careful to only use the pool to calculate the expensive function. In that way we don't get an "explosion" of processes, but because this is asynchronous we need to postpone a lot of work for the callback that the worker calls once the expensive function is done.
More than that, we need to use a countdown latch so that we know when all the separate sub-calls to f() are complete.
There may be a simpler way (I am pretty sure there is, but I need to do other things), but perhaps this gives you an idea of what is possible:
from multiprocessing import Pool, Value, RawArray, RLock
from time import sleep
class Latch:
'''A countdown latch that lets us wait for a job of "n" parts'''
def __init__(self, n):
self.__counter = Value('i', n)
self.__lock = RLock()
def decrement(self):
with self.__lock:
self.__counter.value -= 1
print('dec', self.read())
return self.read() == 0
def read(self):
with self.__lock:
return self.__counter.value
def join(self):
while self.read():
sleep(1)
def list_of_values(x):
'''An expensive function'''
print(x, ': thinking...')
sleep(1)
print(x, ': thought')
return list(range(x))
pool = Pool()
def async_f(x, on_complete=None):
'''Return the sum of the values in the expensive list'''
if x == 0:
on_complete(0) # no list, return 0
else:
n = x # need to know size of result beforehand
latch = Latch(n) # wait for n entires to be calculated
result = RawArray('i', n+1) # where we will assemble the map
def delayed_map(values):
'''This is the callback for the pool async process - it runs
in a separate thread within this process once the
expensive list has been calculated and orchestrates the
mapping of f over the result.'''
result[0] = x # first value in list is x
for (v, i) in enumerate(values):
def callback(fx, i=i):
'''This is the callback passed to f() and is called when
the function completes. If it is the last of all the
calls in the map then it calls on_complete() (ie another
instance of this function) for the calling f().'''
result[i+1] = fx
if latch.decrement(): # have completed list
# at this point result contains [x]+map(f, ...)
on_complete(sum(result)) # so return sum
async_f(v, callback)
# Ask worker to generate list then call delayed_map
pool.apply_async(list_of_values, [x], callback=delayed_map)
def run():
'''Tie into the same mechanism as above, for the final value.'''
result = Value('i')
latch = Latch(1)
def final_callback(value):
result.value = value
latch.decrement()
async_f(6, final_callback)
latch.join() # wait for everything to complete
return result.value
print(run())
PS: I am using Python 3.2 and the ugliness above is because we are delaying computation of the final results (going back up the tree) until later. It's possible something like generators or futures could simplify things.
Also, I suspect you need a cache to avoid needlessly recalculating the expensive function when called with the same argument as earlier.
See also yaniv's answer - which seems to be an alternative way to reverse the order of the evaluation by being explicit about depth.

After thinking about this, I found a simple, not complete, but good enough answer:
# A partially parallel solution. Just do the first level of recursion in parallel. It might be enough work to fill all cores.
import multiprocessing
def f_helper(data):
return f(x=data['x'],depth=data['depth'], recursion_depth=data['recursion_depth'])
def f(x, depth, recursion_depth):
if depth==0:
return ...
else :
if recursion_depth == 0:
pool = multiprocessing.Pool(processes=4)
result = [x] + pool.map(f_helper, [{'x':_x, 'depth':depth-1, 'recursion_depth':recursion_depth+1 } _x in list_of_values(x)])
pool.close()
else:
result = [x] + map(f_helper, [{'x':_x, 'depth':depth-1, 'recursion_depth':recursion_depth+1 } _x in list_of_values(x)])
return result
def list_of_values(x):
# Heavy compute, pure function

I store the main process id initially and transfer it to sub programs.
When I need to start a multiprocessing job, I check the number of children of the main process. If it is less than or equal to the half of my CPU count, then I run it as parallel. If it greater than the half of my CPU count, then I run it serial. In this way, it avoids bottlenecks and uses CPU cores effectively. You can tune the number of cores for your case. For example, you can set it to the exact number of CPU cores, but you should not exceed it.
def subProgramhWrapper(func, args):
func(*args)
parent = psutil.Process(main_process_id)
children = parent.children(recursive=True)
num_cores = int(multiprocessing.cpu_count()/2)
if num_cores >= len(children):
#parallel run
pool = MyPool(num_cores)
results = pool.starmap(subProgram, input_params)
pool.close()
pool.join()
else:
#serial run
for input_param in input_params:
subProgramhWrapper(subProgram, input_param)

Related

Given N generators, is it possible to create a generator that runs them in parallel processes and yields the zip of those generators?

Suppose I have N generators gen_1, ..., gen_N where each on them will yield the same number of values. I would like a generator gen such that it runs gen_1, ..., gen_N in N parallel processes and yields (next(gen_1), next(gen_2), ... next(gen_N))
That is I would like to have:
def gen():
yield (next(gen_1), next(gen_2), ... next(gen_N))
in such a way that each gen_i is running on its own process. Is it possible to do this? I have tried doing this in the following dummy example with no success:
A = range(4)
def gen(a):
B = ['a', 'b', 'c']
for b in B:
yield b + str(a)
def target(g):
return next(g)
processes = [Process(target=target, args=(gen(a),)) for a in A]
for p in processes:
p.start()
for p in processes:
p.join()
However I get the error TypeError: cannot pickle 'generator' object.
EDIT:
I have modified #darkonaut answer's a bit to fit my needs. I am posting it in case some of you find it useful. We first define a couple of utility functions:
from itertools import zip_longest
from typing import List, Generator
def grouper(iterable, n, fillvalue=iter([])):
"Collect data into fixed-length chunks or blocks"
args = [iter(iterable)] * n
return zip_longest(*args, fillvalue=fillvalue)
def split_generators_into_batches(generators: List[Generator], n_splits):
chunks = grouper(generators, len(generators) // n_splits + 1)
return [zip_longest(*chunk) for chunk in chunks]
The following class is responsible for splitting any number of generators into n (number of processes) batches and proccessing them yielding the desired result:
import multiprocessing as mp
class GeneratorParallelProcessor:
SENTINEL = 'S'
def __init__(self, generators, n_processes = 2 * mp.cpu_count()):
self.n_processes = n_processes
self.generators = split_generators_into_batches(list(generators), n_processes)
self.queue = mp.SimpleQueue()
self.barrier = mp.Barrier(n_processes + 1)
self.sentinels = [self.SENTINEL] * n_processes
self.processes = [
mp.Process(target=self._worker, args=(self.barrier, self.queue, gen)) for gen in self.generators
]
def process(self):
for p in self.processes:
p.start()
while True:
results = list(itertools.chain(*(self.queue.get() for _ in self.generators)))
if results != self.sentinels:
yield results
self.barrier.wait()
else:
break
for p in self.processes:
p.join()
def _worker(self, barrier, queue, generator):
for x in generator:
queue.put(x)
barrier.wait()
queue.put(self.SENTINEL)
To use it just do the following:
parallel_processor = GeneratorParallelProcessor(generators)
for grouped_generator in parallel_processor.process():
output_handler(grouped_generator)
It's possible to get such an "Unified Parallel Generator (UPG)" (attempt to coin a name) with some effort, but as #jasonharper already mentioned, you definitely need to assemble the sub-generators within the child-processes, since a running generator can't be pickled.
The pattern below is re-usable with only the generator function gen() being custom to this example. The design uses multiprocessing.SimpleQueue for returning generator results to the parent and multiprocessing.Barrier for synchronization.
Calling Barrier.wait() will block the caller (thread in any process) until the number of specified parties has called .wait(), whereupon all threads currently waiting on the Barrier get released simultaneously. The usage of Barrier here ensures further generator-results are only started to be computed after the parent has received all results from an iteration, which might be desirable to keep overall memory consumption in check.
The number of parallel workers used equals the number of argument-tuples you provide within the gen_args_tuples-iterable, so gen_args_tuples=zip(range(4)) will use four workers for example. See comments in code for further details.
import multiprocessing as mp
SENTINEL = 'SENTINEL'
def gen(a):
"""Your individual generator function."""
lst = ['a', 'b', 'c']
for ch in lst:
for _ in range(int(10e6)): # some dummy computation
pass
yield ch + str(a)
def _worker(i, barrier, queue, gen_func, gen_args):
for x in gen_func(*gen_args):
print(f"WORKER-{i} sending item.")
queue.put((i, x))
barrier.wait()
queue.put(SENTINEL)
def parallel_gen(gen_func, gen_args_tuples):
"""Construct and yield from parallel generators
build from `gen_func(gen_args)`.
"""
gen_args_tuples = list(gen_args_tuples) # ensure list
n_gens = len(gen_args_tuples)
sentinels = [SENTINEL] * n_gens
queue = mp.SimpleQueue()
barrier = mp.Barrier(n_gens + 1) # `parties`: + 1 for parent
processes = [
mp.Process(target=_worker, args=(i, barrier, queue, gen_func, args))
for i, args in enumerate(gen_args_tuples)
]
for p in processes:
p.start()
while True:
results = [queue.get() for _ in range(n_gens)]
if results != sentinels:
results.sort()
yield tuple(r[1] for r in results) # sort and drop ids
barrier.wait() # all workers are waiting
# already, so this will unblock immediately
else:
break
for p in processes:
p.join()
if __name__ == '__main__':
for res in parallel_gen(gen_func=gen, gen_args_tuples=zip(range(4))):
print(res)
Output:
WORKER-1 sending item.
WORKER-0 sending item.
WORKER-3 sending item.
WORKER-2 sending item.
('a0', 'a1', 'a2', 'a3')
WORKER-1 sending item.
WORKER-2 sending item.
WORKER-3 sending item.
WORKER-0 sending item.
('b0', 'b1', 'b2', 'b3')
WORKER-2 sending item.
WORKER-3 sending item.
WORKER-1 sending item.
WORKER-0 sending item.
('c0', 'c1', 'c2', 'c3')
Process finished with exit code 0
I went for a little different approach, you can modify the example below accordingly.
So somewhere in the main script initialize the pool according to your needs, you need just this 2 lines
from multiprocessing import Pool
pool = Pool(processes=4)
then you can define a generator function like this:
(Note that the generators input is assumed to be any iterable containing all the generators)
def parallel_generators(generators, pool):
results = ['placeholder']
while len(results) != 0:
batch = pool.map_async(next, generators) # defines the next round of values
results = list(batch.get) # actual calculation done here
yield results
return
We define the results condition in the while loop like this because map objects with next and generators return an empty list when the generators stop producing values. So at that point we just terminate the parallel generator.
EDIT
So apparently multiproccecing pool, and map don't play good with generators making the above code not work as intended so do not use until later update.
As for the pickle error it seems some bound functions do not support pickle which is needed in the multiprocessing library in order to transfer objects and functions, for a workaround the pathos mutliprocessing library uses dill which solves the need for pickle and is an option you might want to try, searching in Stack Overflow for your error you can also find some more complicated solutions with custom code for pickling the functions needed.

Python core usage slower/under 100% with multiprocessing.Pool

Code that runs on one core # 100% actually runs slower when multiprocessed, where it runs on several cores # ~50%.
This question is asked frequently, and the best threads I've found about it (0, 1) give the answer, "It's because the workload isn't heavy enough, so the inter-process communication (IPC) overhead ends up making things slower."
I don't know whether or not this is right, but I've isolated an example where this happens AND doesn't happen for the same workload, and I want to know whether this answer still applies or why it actually happens:
from multiprocessing import Pool
def f(n):
res = 0
for i in range(n):
res += i**2
return res
def single(n):
""" Single core """
for i in range(n):
f(n)
def multi(n):
""" Multi core """
pool = Pool(2)
for i in range(n):
pool.apply_async(f, (n,))
pool.close()
pool.join()
def single_r(n):
""" Single core, returns """
res = 0
for i in range(n):
res = f(n) % 1000 # Prevent overflow
return res
def multi_r(n):
""" Multi core, returns """
pool = Pool(2)
res = 0
for i in range(n):
res = pool.apply_async(f, (n,)).get() % 1000
pool.close()
pool.join()
return res
# Run
n = 5000
if __name__ == "__main__":
print(f"single({n})...", end='')
single(n)
print(" DONE")
print(f"multi({n})...", end='')
multi(n)
print(" DONE")
print(f"single_r({n})...", end='')
single_r(n)
print(" DONE")
print(f"multi_r({n})...", end='')
multi_r(n)
print(" DONE")
The workload is f().
f() is run single-cored and dual-cored without return calls via single() and multi().
Then f() is run single-cored and dual-cored with return calls via single_r() and multi_r().
My result is that slowdown happens when f() is run multiprocessed with return calls. Without returns, it doesn't happen.
So single() takes q seconds. multi() is much faster. Good. Then single_r() takes q seconds. But then multi_r() takes much more than q seconds. Visual inspection of my system monitor corroborates this (a little hard to tell, but the multi(n) hump is shaded two colors, indicating activity from two different cores).
Also, corroborating video of the terminal outputs
Even with uniform workload, is this still IPC overhead? Is such overhead only paid when other processes return their results, and, if so, is there a way to avoid it while still returning results?
As Darkonaut pointed out, the slowdown when using multiple processes in multi_r() is because the get() call is blocking:
for i in range(n):
res = pool.apply_async(f, (n,)).get() % 1000
This effectively runs the workload sequentially or concurrently (more akin to multithreaded) while adding multiprocess overhead, making it run slower than the single-cored equivalent single_r()!
Meanwhile, multi() ran faster (i.e., ran in parallel correctly) because it contains no get() calls.
To run parallel and return results, collect result objects first as in:
def multi_r_collected(n):
""" Multi core, collects apply_async() results before returning them """
pool = Pool(2)
res = 0
res = [pool.apply_async(f, (n,)) for i in range(n)] # Collect first!
pool.close()
pool.join()
res = [r.get() % 1000 for r in res] # .get() after!
return res
Visual inspection of CPU activity corroborates the noticed speed-up; when run with 12 processes via Pool(12), there's a clean, uniform mesa of multiple cores clearly running at 100% in parallel (not the 50% mishmash of multi_r(n)).

`multiprocessing` `starmap_async` only calls callback once?

I have the following code which creates a pool for 4 workers, and calls a worker method. the code works fine for the most part. when running I see that different workers are being called to process the work. However calc_completed is never called once at the very end when all workers are complete. is this expected behaviour? I would have expected the callback to happen when each worker is completed.
def calculate_worker(x, y):
print 'working...'
...
def calc_completed(result):
print 'completed: %s'%str(result)
def calc_errored(result):
print 'error: %s'%str(result)
if __name__ == '__main__':
start, stop, step = 1, 1000, 1
ranges = [(n, min(n+step, stop)) for n in xrange(start, stop, step)]
pool = mp.Pool(processes=8)
res = pool.starmap_async(calculate_worker, ranges,
callback=calculate_worker, error_callback=calc_completed)
pool.close()
pool.join()
d = res.get()
print(d)
calc_completed is would only be called should there was any error encountered in the execution of the the mapped function (here: calculate_worker).
Another issue in your code is that you both running calculate_worker function in parallel and using it as a callback. This does not make much sense as calculate_worker will be called twice - first: as a worker function and secondly: as a function to report that the calculation have finished. You should have two different function there.
Given the functions in the snippet you provided I would change it the following way:
res = pool.starmap_async(calculate_worker, ranges,
callback=calc_completed,
error_callback=calc_errored)
If you want to test if calc_errored is called appropriately then you can introduce some random errors in the calculate_worker function to see if it is going to be handled, e.g.
def calculate_worker(x, y):
if (x % 7):
x / (y - y) # division by zero
print 'working...'

Concurrent futures wait for subset of tasks

I'm using Python's concurrent.futures framework. I have used the map() function to launch concurrent tasks as such:
def func(i):
return i*i
list = [1,2,3,4,5]
async_executor = concurrent.futures.ThreadPoolExecutor(5)
results = async_executor.map(func,list)
I am interested only in the first n results and want to stop the executor after the first n threads are finished where n is a number less than the size of the input list. Is there any way to do this in Python? Is there another framework I should look into?
You can't use map() for this because it provides no way to stop waiting for the results, nor any way to get the submitted futures and cancel them. However, you can do it using submit():
import concurrent.futures
import time
def func(i):
time.sleep(i)
return i*i
list = [1,2,3,6,6,6,90,100]
async_executor = concurrent.futures.ThreadPoolExecutor(2)
futures = {async_executor.submit(func, i): i for i in list}
for ii, future in enumerate(concurrent.futures.as_completed(futures)):
print(ii, "result is", future.result())
if ii == 2:
async_executor.shutdown(wait=False)
for victim in futures:
victim.cancel()
break
The above code takes about 11 seconds to run--it executes jobs [1,2,3,6,7] but not the rest.

Eliminating overhead in multiprocessing with pool

I am currently in a situation where I have parallelized code called repeatedly and try to reduce the overhead associated with the multiprocessing. So, consider the following example, which deliberately contains no "expensive" computations:
import multiprocessing as mp
def f(x):
# toy function
return x*x
if __name__ == '__main__':
for x in range(500):
pool = mp.Pool(processes=2)
print(pool.map(f, range(x, x + 50)))
pool.close()
pool.join() # necessary?
This code takes 53 seconds compared to 0.04 seconds for the sequential approach.
First question: do I really need to call pool.join() in this case when only pool.map() is ever used? I cannot find any negative effects from omitting it and the runtime would drop to 4.8 seconds. (I understand that omitting pool.close() is not possible, as we would be leaking threads then.)
Now, while this would be a nice improvement, as a first answer I would probably get "well, don't create the pool in the loop in the first place". Ok, no problem, but the parallelized code actually lives in an instance method, so I would use:
class MyObject:
def __init__(self):
self.pool = mp.Pool(processes=2)
def function(self, x):
print(self.pool.map(f, range(x, x + 50)))
if __name__ == '__main__':
my_object = MyObject()
for x in range(500):
my_object.function(x)
This would be my favorite solution as it runs in excellent 0.4 seconds.
Second question: should I call pool.close()/pool.join() somewhere explicitly (e.g. in the destructor of MyObject) or is the current code sufficient? (If it matters: it is ok to assume there are only a few long-lived instances of MyObject in my project.)
Of course it takes a long time: you keep allocating a new pool and destroying it for every x.
It will run much faster if instead you do:
if __name__ == '__main__':
pool = mp.Pool(processes=2) # allocate the pool only once
for x in range(500):
print(pool.map(f, range(x, x + 50)))
pool.close() # close it only after all the requests are submitted
pool.join() # wait for the last worker to finish
Try that and you'll see it now works much faster.
Here are links to the docs for join and close:
Once close is called you can't submit more tasks to the pool, and join waits till the last worker finished its job. They should be called in that order (first close then join).
Well, actually you could pass already allocated pool as argument to your object:
class MyObject:
def __init__(self, pool):
self.pool = pool
def function(self, x):
print(self.pool.map(f, range(x, x + 50)))
if __name__ == '__main__':
with mp.Pool(2) as pool:
my_object = MyObject(pool)
my_second_object = MyObject(pool)
for x in range(500):
my_object.function(x)
my_second_object.function(x)
pool.close()
I can not find a reason why it might be necessary to use different pools in different objects

Categories