I've got a Class which stores a large numpy array in the state. This is causing multiprocessing.Pool to become extremely slow. Here's a MRE:
from multiprocessing import Pool
import numpy
import time
from tqdm import tqdm
class MP(object):
def __init__(self, mat):
self.mat = mat
def foo(self, x):
time.sleep(1)
return x*x + self.mat.shape[0]
def bar(self, arr):
results = []
with Pool() as p:
for x in tqdm(p.imap(self.foo, arr)):
results.append(x)
return results
if __name__ == '__main__':
x = numpy.arange(8)
mat = numpy.random.random((1,1))
h = MP(mat)
res = h.bar(x)
print(res)
I've got 4 cores on CPU, which means that this code should (and does) run in approximately 2 seconds. (The tqdm shows the 2 seconds as a progress bar, it's not really necessary to this example). However, in the main program, if I do mat = numpy.random.random((10000,10000)), it takes forever to run. I suspect this is because Pool is making copies of mat for each worker, but I'm not sure how this works because mat is in the state of the Class, and not directly involved in the imap call. So, my questions are:
Why is this behavior happening? (i.e., how does Pool work within a Class? What exactly does it pickle? What copies are made, and what is passed by reference?)
What is a viable workaround to this problem?
Edit: Modified foo to make use of mat, which is more representative of my real problem.
If as you say mat is not directly involved in the imap call, I'm guessing in general the state of MP is not used in the imap call (if it is, comment below and I'll remove this answer). If that's the case, you should write foo as an unbound function instead of as a method of MP. The reason mat is getting copied right now is because each execution of foo needs to be passed in self, which contains self.mat.
The following executes quickly regardless of the size of mat:
from multiprocessing import Pool
import numpy
import time
from tqdm import tqdm
class MP(object):
def __init__(self, mat):
self.mat = mat
def bar(self, arr):
results = []
with Pool() as p:
for x in tqdm(p.imap(foo, arr)):
results.append(x)
return results
def foo(x):
time.sleep(1)
return x * x
if __name__ == '__main__':
x = numpy.arange(8)
mat = numpy.random.random((10000, 10000))
h = MP(mat)
res = h.bar(x)
print(res)
If foo actually does need to be passed MP because it actually does need to read from mat, then there is no way to avoid sending mat to each processor, and your question 2 does not have an answer other than "you can't". But hopefully I've answered your question 1.
Related
I want to run a loop in parallel using pool and store each result from a return of a function into an index of numpy array. I have written a basic function here, real one is a bit complex. Even in this basic one I am not getting desired output. By printing results at the end I am getting 100 different arrays of 100 values instead of one array of 100 values. How do I solve this or is there a better way to store return values. Because I have to take a mean and std of rejects after pool.
from multiprocessing import Pool
import numpy as np
rejects = np.zeros(100)
def func(i):
print("this is:",i)
rejects[i]=i
# print (rejects)
return rejects
def main():
l = [*range(1,100, 1)]
pool = Pool(3)
results=pool.map(func, l)
pool.close()
pool.join()
print (results)
if __name__ == '__main__':
main()
Because you are giving an array argument to func and also assigning that array as a single element in the array rejects. you can use the func below:
def func(i):
print("this is:",i)
rejects=i # this is where I have changed
# print (rejects)
return rejects
I would like to use a numpy array in shared memory for use with the multiprocessing module. The difficulty is using it like a numpy array, and not just as a ctypes array.
from multiprocessing import Process, Array
import scipy
def f(a):
a[0] = -a[0]
if __name__ == '__main__':
# Create the array
N = int(10)
unshared_arr = scipy.rand(N)
arr = Array('d', unshared_arr)
print "Originally, the first two elements of arr = %s"%(arr[:2])
# Create, start, and finish the child processes
p = Process(target=f, args=(arr,))
p.start()
p.join()
# Printing out the changed values
print "Now, the first two elements of arr = %s"%arr[:2]
This produces output such as:
Originally, the first two elements of arr = [0.3518653236697369, 0.517794725524976]
Now, the first two elements of arr = [-0.3518653236697369, 0.517794725524976]
The array can be accessed in a ctypes manner, e.g. arr[i] makes sense. However, it is not a numpy array, and I cannot perform operations such as -1*arr, or arr.sum(). I suppose a solution would be to convert the ctypes array into a numpy array. However (besides not being able to make this work), I don't believe it would be shared anymore.
It seems there would be a standard solution to what has to be a common problem.
To add to #unutbu's (not available anymore) and #Henry Gomersall's answers. You could use shared_arr.get_lock() to synchronize access when needed:
shared_arr = mp.Array(ctypes.c_double, N)
# ...
def f(i): # could be anything numpy accepts as an index such another numpy array
with shared_arr.get_lock(): # synchronize access
arr = np.frombuffer(shared_arr.get_obj()) # no data copying
arr[i] = -arr[i]
Example
import ctypes
import logging
import multiprocessing as mp
from contextlib import closing
import numpy as np
info = mp.get_logger().info
def main():
logger = mp.log_to_stderr()
logger.setLevel(logging.INFO)
# create shared array
N, M = 100, 11
shared_arr = mp.Array(ctypes.c_double, N)
arr = tonumpyarray(shared_arr)
# fill with random values
arr[:] = np.random.uniform(size=N)
arr_orig = arr.copy()
# write to arr from different processes
with closing(mp.Pool(initializer=init, initargs=(shared_arr,))) as p:
# many processes access the same slice
stop_f = N // 10
p.map_async(f, [slice(stop_f)]*M)
# many processes access different slices of the same array
assert M % 2 # odd
step = N // 10
p.map_async(g, [slice(i, i + step) for i in range(stop_f, N, step)])
p.join()
assert np.allclose(((-1)**M)*tonumpyarray(shared_arr), arr_orig)
def init(shared_arr_):
global shared_arr
shared_arr = shared_arr_ # must be inherited, not passed as an argument
def tonumpyarray(mp_arr):
return np.frombuffer(mp_arr.get_obj())
def f(i):
"""synchronized."""
with shared_arr.get_lock(): # synchronize access
g(i)
def g(i):
"""no synchronization."""
info("start %s" % (i,))
arr = tonumpyarray(shared_arr)
arr[i] = -1 * arr[i]
info("end %s" % (i,))
if __name__ == '__main__':
mp.freeze_support()
main()
If you don't need synchronized access or you create your own locks then mp.Array() is unnecessary. You could use mp.sharedctypes.RawArray in this case.
The Array object has a get_obj() method associated with it, which returns the ctypes array which presents a buffer interface. I think the following should work...
from multiprocessing import Process, Array
import scipy
import numpy
def f(a):
a[0] = -a[0]
if __name__ == '__main__':
# Create the array
N = int(10)
unshared_arr = scipy.rand(N)
a = Array('d', unshared_arr)
print "Originally, the first two elements of arr = %s"%(a[:2])
# Create, start, and finish the child process
p = Process(target=f, args=(a,))
p.start()
p.join()
# Print out the changed values
print "Now, the first two elements of arr = %s"%a[:2]
b = numpy.frombuffer(a.get_obj())
b[0] = 10.0
print a[0]
When run, this prints out the first element of a now being 10.0, showing a and b are just two views into the same memory.
In order to make sure it is still multiprocessor safe, I believe you will have to use the acquire and release methods that exist on the Array object, a, and its built in lock to make sure its all safely accessed (though I'm not an expert on the multiprocessor module).
While the answers already given are good, there is a much easier solution to this problem provided two conditions are met:
You are on a POSIX-compliant operating system (e.g. Linux, Mac OSX); and
Your child processes need read-only access to the shared array.
In this case you do not need to fiddle with explicitly making variables shared, as the child processes will be created using a fork. A forked child automatically shares the parent's memory space. In the context of Python multiprocessing, this means it shares all module-level variables; note that this does not hold for arguments that you explicitly pass to your child processes or to the functions you call on a multiprocessing.Pool or so.
A simple example:
import multiprocessing
import numpy as np
# will hold the (implicitly mem-shared) data
data_array = None
# child worker function
def job_handler(num):
# built-in id() returns unique memory ID of a variable
return id(data_array), np.sum(data_array)
def launch_jobs(data, num_jobs=5, num_worker=4):
global data_array
data_array = data
pool = multiprocessing.Pool(num_worker)
return pool.map(job_handler, range(num_jobs))
# create some random data and execute the child jobs
mem_ids, sumvals = zip(*launch_jobs(np.random.rand(10)))
# this will print 'True' on POSIX OS, since the data was shared
print(np.all(np.asarray(mem_ids) == id(data_array)))
I've written a small python module that uses POSIX shared memory to share numpy arrays between python interpreters. Maybe you will find it handy.
https://pypi.python.org/pypi/SharedArray
Here's how it works:
import numpy as np
import SharedArray as sa
# Create an array in shared memory
a = sa.create("test1", 10)
# Attach it as a different array. This can be done from another
# python interpreter as long as it runs on the same computer.
b = sa.attach("test1")
# See how they are actually sharing the same memory block
a[0] = 42
print(b[0])
# Destroying a does not affect b.
del a
print(b[0])
# See how "test1" is still present in shared memory even though we
# destroyed the array a.
sa.list()
# Now destroy the array "test1" from memory.
sa.delete("test1")
# The array b is not affected, but once you destroy it then the
# data are lost.
print(b[0])
You can use the sharedmem module: https://bitbucket.org/cleemesser/numpy-sharedmem
Here's your original code then, this time using shared memory that behaves like a NumPy array (note the additional last statement calling a NumPy sum() function):
from multiprocessing import Process
import sharedmem
import scipy
def f(a):
a[0] = -a[0]
if __name__ == '__main__':
# Create the array
N = int(10)
unshared_arr = scipy.rand(N)
arr = sharedmem.empty(N)
arr[:] = unshared_arr.copy()
print "Originally, the first two elements of arr = %s"%(arr[:2])
# Create, start, and finish the child process
p = Process(target=f, args=(arr,))
p.start()
p.join()
# Print out the changed values
print "Now, the first two elements of arr = %s"%arr[:2]
# Perform some NumPy operation
print arr.sum()
With Python3.8+ there is the multiprocessing.shared_memory standard library:
# np_sharing.py
from multiprocessing import Process
from multiprocessing.managers import SharedMemoryManager
from multiprocessing.shared_memory import SharedMemory
from typing import Tuple
import numpy as np
def create_np_array_from_shared_mem(
shared_mem: SharedMemory, shared_data_dtype: np.dtype, shared_data_shape: Tuple[int, ...]
) -> np.ndarray:
arr = np.frombuffer(shared_mem.buf, dtype=shared_data_dtype)
arr = arr.reshape(shared_data_shape)
return arr
def child_process(
shared_mem: SharedMemory, shared_data_dtype: np.dtype, shared_data_shape: Tuple[int, ...]
):
"""Logic to be executed by the child process"""
arr = create_np_array_from_shared_mem(shared_mem, shared_data_dtype, shared_data_shape)
arr[0, 0] = -arr[0, 0] # modify the array backed by shared memory
def main():
"""Logic to be executed by the parent process"""
# Data to be shared:
data_to_share = np.random.rand(10, 10)
SHARED_DATA_DTYPE = data_to_share.dtype
SHARED_DATA_SHAPE = data_to_share.shape
SHARED_DATA_NBYTES = data_to_share.nbytes
with SharedMemoryManager() as smm:
shared_mem = smm.SharedMemory(size=SHARED_DATA_NBYTES)
arr = create_np_array_from_shared_mem(shared_mem, SHARED_DATA_DTYPE, SHARED_DATA_SHAPE)
arr[:] = data_to_share # load the data into shared memory
print(f"The [0,0] element of arr is {arr[0,0]}") # before
# Run child process:
p = Process(target=child_process, args=(shared_mem, SHARED_DATA_DTYPE, SHARED_DATA_SHAPE))
p.start()
p.join()
print(f"The [0,0] element of arr is {arr[0,0]}") # after
del arr # delete np array so the shared memory can be deallocated
if __name__ == "__main__":
main()
Running the script:
$ python3.10 np_sharing.py
The [0,0] element of arr is 0.262091705529628
The [0,0] element of arr is -0.262091705529628
Since the arrays in different processes share the same underlying memory buffer, the standard caveats r.e. race conditions apply.
This code shows the structure of what I am trying to do.
import multiprocessing
from foo import really_expensive_to_compute_object
## Create a really complicated object that is *hard* to initialise.
T = really_expensive_to_compute_object(10)
def f(x):
return T.cheap_calculation(x)
P = multiprocessing.Pool(processes=64)
results = P.map(f, range(1000000))
print results
The problem is that each process starts by spending a lot of time recalculating T instead of using the original T that was computed once. Is there a way to prevent this? T has a fast (deep) copy method, so can I get Python to use that instead of recalculating?
multiprocessing documentation suggests
Explicitly pass resources to child processes
So your code can be rewritenn to something like this:
import multiprocessing
import time
import functools
class really_expensive_to_compute_object(object):
def __init__(self, arg):
print 'expensive creation'
time.sleep(3)
def cheap_calculation(self, x):
return x * 2
def f(T, x):
return T.cheap_calculation(x)
if __name__ == '__main__':
## Create a really complicated object that is *hard* to initialise.
T = really_expensive_to_compute_object(10)
## helper, to pass expensive object to function
f_helper = functools.partial(f, T)
# i've reduced count for tests
P = multiprocessing.Pool(processes=4)
results = P.map(f_helper, range(100))
print results
Why not have f take a T parameter instead of referencing the global, and do the copies yourself?
import multiprocessing, copy
from foo import really_expensive_to_compute_object
## Create a really complicated object that is *hard* to initialise.
T = really_expensive_to_compute_object(10)
def f(t, x):
return t.cheap_calculation(x)
P = multiprocessing.Pool(processes=64)
results = P.map(f, (copy.deepcopy(T) for _ in range(1000000)), range(1000000))
print results
I would like to use a numpy array in shared memory for use with the multiprocessing module. The difficulty is using it like a numpy array, and not just as a ctypes array.
from multiprocessing import Process, Array
import scipy
def f(a):
a[0] = -a[0]
if __name__ == '__main__':
# Create the array
N = int(10)
unshared_arr = scipy.rand(N)
arr = Array('d', unshared_arr)
print "Originally, the first two elements of arr = %s"%(arr[:2])
# Create, start, and finish the child processes
p = Process(target=f, args=(arr,))
p.start()
p.join()
# Printing out the changed values
print "Now, the first two elements of arr = %s"%arr[:2]
This produces output such as:
Originally, the first two elements of arr = [0.3518653236697369, 0.517794725524976]
Now, the first two elements of arr = [-0.3518653236697369, 0.517794725524976]
The array can be accessed in a ctypes manner, e.g. arr[i] makes sense. However, it is not a numpy array, and I cannot perform operations such as -1*arr, or arr.sum(). I suppose a solution would be to convert the ctypes array into a numpy array. However (besides not being able to make this work), I don't believe it would be shared anymore.
It seems there would be a standard solution to what has to be a common problem.
To add to #unutbu's (not available anymore) and #Henry Gomersall's answers. You could use shared_arr.get_lock() to synchronize access when needed:
shared_arr = mp.Array(ctypes.c_double, N)
# ...
def f(i): # could be anything numpy accepts as an index such another numpy array
with shared_arr.get_lock(): # synchronize access
arr = np.frombuffer(shared_arr.get_obj()) # no data copying
arr[i] = -arr[i]
Example
import ctypes
import logging
import multiprocessing as mp
from contextlib import closing
import numpy as np
info = mp.get_logger().info
def main():
logger = mp.log_to_stderr()
logger.setLevel(logging.INFO)
# create shared array
N, M = 100, 11
shared_arr = mp.Array(ctypes.c_double, N)
arr = tonumpyarray(shared_arr)
# fill with random values
arr[:] = np.random.uniform(size=N)
arr_orig = arr.copy()
# write to arr from different processes
with closing(mp.Pool(initializer=init, initargs=(shared_arr,))) as p:
# many processes access the same slice
stop_f = N // 10
p.map_async(f, [slice(stop_f)]*M)
# many processes access different slices of the same array
assert M % 2 # odd
step = N // 10
p.map_async(g, [slice(i, i + step) for i in range(stop_f, N, step)])
p.join()
assert np.allclose(((-1)**M)*tonumpyarray(shared_arr), arr_orig)
def init(shared_arr_):
global shared_arr
shared_arr = shared_arr_ # must be inherited, not passed as an argument
def tonumpyarray(mp_arr):
return np.frombuffer(mp_arr.get_obj())
def f(i):
"""synchronized."""
with shared_arr.get_lock(): # synchronize access
g(i)
def g(i):
"""no synchronization."""
info("start %s" % (i,))
arr = tonumpyarray(shared_arr)
arr[i] = -1 * arr[i]
info("end %s" % (i,))
if __name__ == '__main__':
mp.freeze_support()
main()
If you don't need synchronized access or you create your own locks then mp.Array() is unnecessary. You could use mp.sharedctypes.RawArray in this case.
The Array object has a get_obj() method associated with it, which returns the ctypes array which presents a buffer interface. I think the following should work...
from multiprocessing import Process, Array
import scipy
import numpy
def f(a):
a[0] = -a[0]
if __name__ == '__main__':
# Create the array
N = int(10)
unshared_arr = scipy.rand(N)
a = Array('d', unshared_arr)
print "Originally, the first two elements of arr = %s"%(a[:2])
# Create, start, and finish the child process
p = Process(target=f, args=(a,))
p.start()
p.join()
# Print out the changed values
print "Now, the first two elements of arr = %s"%a[:2]
b = numpy.frombuffer(a.get_obj())
b[0] = 10.0
print a[0]
When run, this prints out the first element of a now being 10.0, showing a and b are just two views into the same memory.
In order to make sure it is still multiprocessor safe, I believe you will have to use the acquire and release methods that exist on the Array object, a, and its built in lock to make sure its all safely accessed (though I'm not an expert on the multiprocessor module).
While the answers already given are good, there is a much easier solution to this problem provided two conditions are met:
You are on a POSIX-compliant operating system (e.g. Linux, Mac OSX); and
Your child processes need read-only access to the shared array.
In this case you do not need to fiddle with explicitly making variables shared, as the child processes will be created using a fork. A forked child automatically shares the parent's memory space. In the context of Python multiprocessing, this means it shares all module-level variables; note that this does not hold for arguments that you explicitly pass to your child processes or to the functions you call on a multiprocessing.Pool or so.
A simple example:
import multiprocessing
import numpy as np
# will hold the (implicitly mem-shared) data
data_array = None
# child worker function
def job_handler(num):
# built-in id() returns unique memory ID of a variable
return id(data_array), np.sum(data_array)
def launch_jobs(data, num_jobs=5, num_worker=4):
global data_array
data_array = data
pool = multiprocessing.Pool(num_worker)
return pool.map(job_handler, range(num_jobs))
# create some random data and execute the child jobs
mem_ids, sumvals = zip(*launch_jobs(np.random.rand(10)))
# this will print 'True' on POSIX OS, since the data was shared
print(np.all(np.asarray(mem_ids) == id(data_array)))
I've written a small python module that uses POSIX shared memory to share numpy arrays between python interpreters. Maybe you will find it handy.
https://pypi.python.org/pypi/SharedArray
Here's how it works:
import numpy as np
import SharedArray as sa
# Create an array in shared memory
a = sa.create("test1", 10)
# Attach it as a different array. This can be done from another
# python interpreter as long as it runs on the same computer.
b = sa.attach("test1")
# See how they are actually sharing the same memory block
a[0] = 42
print(b[0])
# Destroying a does not affect b.
del a
print(b[0])
# See how "test1" is still present in shared memory even though we
# destroyed the array a.
sa.list()
# Now destroy the array "test1" from memory.
sa.delete("test1")
# The array b is not affected, but once you destroy it then the
# data are lost.
print(b[0])
You can use the sharedmem module: https://bitbucket.org/cleemesser/numpy-sharedmem
Here's your original code then, this time using shared memory that behaves like a NumPy array (note the additional last statement calling a NumPy sum() function):
from multiprocessing import Process
import sharedmem
import scipy
def f(a):
a[0] = -a[0]
if __name__ == '__main__':
# Create the array
N = int(10)
unshared_arr = scipy.rand(N)
arr = sharedmem.empty(N)
arr[:] = unshared_arr.copy()
print "Originally, the first two elements of arr = %s"%(arr[:2])
# Create, start, and finish the child process
p = Process(target=f, args=(arr,))
p.start()
p.join()
# Print out the changed values
print "Now, the first two elements of arr = %s"%arr[:2]
# Perform some NumPy operation
print arr.sum()
With Python3.8+ there is the multiprocessing.shared_memory standard library:
# np_sharing.py
from multiprocessing import Process
from multiprocessing.managers import SharedMemoryManager
from multiprocessing.shared_memory import SharedMemory
from typing import Tuple
import numpy as np
def create_np_array_from_shared_mem(
shared_mem: SharedMemory, shared_data_dtype: np.dtype, shared_data_shape: Tuple[int, ...]
) -> np.ndarray:
arr = np.frombuffer(shared_mem.buf, dtype=shared_data_dtype)
arr = arr.reshape(shared_data_shape)
return arr
def child_process(
shared_mem: SharedMemory, shared_data_dtype: np.dtype, shared_data_shape: Tuple[int, ...]
):
"""Logic to be executed by the child process"""
arr = create_np_array_from_shared_mem(shared_mem, shared_data_dtype, shared_data_shape)
arr[0, 0] = -arr[0, 0] # modify the array backed by shared memory
def main():
"""Logic to be executed by the parent process"""
# Data to be shared:
data_to_share = np.random.rand(10, 10)
SHARED_DATA_DTYPE = data_to_share.dtype
SHARED_DATA_SHAPE = data_to_share.shape
SHARED_DATA_NBYTES = data_to_share.nbytes
with SharedMemoryManager() as smm:
shared_mem = smm.SharedMemory(size=SHARED_DATA_NBYTES)
arr = create_np_array_from_shared_mem(shared_mem, SHARED_DATA_DTYPE, SHARED_DATA_SHAPE)
arr[:] = data_to_share # load the data into shared memory
print(f"The [0,0] element of arr is {arr[0,0]}") # before
# Run child process:
p = Process(target=child_process, args=(shared_mem, SHARED_DATA_DTYPE, SHARED_DATA_SHAPE))
p.start()
p.join()
print(f"The [0,0] element of arr is {arr[0,0]}") # after
del arr # delete np array so the shared memory can be deallocated
if __name__ == "__main__":
main()
Running the script:
$ python3.10 np_sharing.py
The [0,0] element of arr is 0.262091705529628
The [0,0] element of arr is -0.262091705529628
Since the arrays in different processes share the same underlying memory buffer, the standard caveats r.e. race conditions apply.
How do I parallelize a recursive function in Python?
My function looks like this:
def f(x, depth):
if x==0:
return ...
else :
return [x] + map(lambda x:f(x, depth-1), list_of_values(x))
def list_of_values(x):
# Heavy compute, pure function
When trying to parallelize it with multiprocessing.Pool.map, Windows opens an infinite number of processes and hangs.
What's a good (preferably simple) way to parallelize it (for a single multicore machine)?
Here is the code that hangs:
from multiprocessing import Pool
pool = pool(processes=4)
def f(x, depth):
if x==0:
return ...
else :
return [x] + pool.map(lambda x:f(x, depth-1), list_of_values(x))
def list_of_values(x):
# Heavy compute, pure function
OK, sorry for the problems with this.
I'm going to answer a slightly different question where f() returns the sum of the values in the list. That is because it's not clear to me from your example what the return type of f() would be, and using an integer makes the code simple to understand.
This is complex because there are two different things happening in parallel:
the calculation of the expensive function in the pool
the recursive expansion of f()
I am very careful to only use the pool to calculate the expensive function. In that way we don't get an "explosion" of processes, but because this is asynchronous we need to postpone a lot of work for the callback that the worker calls once the expensive function is done.
More than that, we need to use a countdown latch so that we know when all the separate sub-calls to f() are complete.
There may be a simpler way (I am pretty sure there is, but I need to do other things), but perhaps this gives you an idea of what is possible:
from multiprocessing import Pool, Value, RawArray, RLock
from time import sleep
class Latch:
'''A countdown latch that lets us wait for a job of "n" parts'''
def __init__(self, n):
self.__counter = Value('i', n)
self.__lock = RLock()
def decrement(self):
with self.__lock:
self.__counter.value -= 1
print('dec', self.read())
return self.read() == 0
def read(self):
with self.__lock:
return self.__counter.value
def join(self):
while self.read():
sleep(1)
def list_of_values(x):
'''An expensive function'''
print(x, ': thinking...')
sleep(1)
print(x, ': thought')
return list(range(x))
pool = Pool()
def async_f(x, on_complete=None):
'''Return the sum of the values in the expensive list'''
if x == 0:
on_complete(0) # no list, return 0
else:
n = x # need to know size of result beforehand
latch = Latch(n) # wait for n entires to be calculated
result = RawArray('i', n+1) # where we will assemble the map
def delayed_map(values):
'''This is the callback for the pool async process - it runs
in a separate thread within this process once the
expensive list has been calculated and orchestrates the
mapping of f over the result.'''
result[0] = x # first value in list is x
for (v, i) in enumerate(values):
def callback(fx, i=i):
'''This is the callback passed to f() and is called when
the function completes. If it is the last of all the
calls in the map then it calls on_complete() (ie another
instance of this function) for the calling f().'''
result[i+1] = fx
if latch.decrement(): # have completed list
# at this point result contains [x]+map(f, ...)
on_complete(sum(result)) # so return sum
async_f(v, callback)
# Ask worker to generate list then call delayed_map
pool.apply_async(list_of_values, [x], callback=delayed_map)
def run():
'''Tie into the same mechanism as above, for the final value.'''
result = Value('i')
latch = Latch(1)
def final_callback(value):
result.value = value
latch.decrement()
async_f(6, final_callback)
latch.join() # wait for everything to complete
return result.value
print(run())
PS: I am using Python 3.2 and the ugliness above is because we are delaying computation of the final results (going back up the tree) until later. It's possible something like generators or futures could simplify things.
Also, I suspect you need a cache to avoid needlessly recalculating the expensive function when called with the same argument as earlier.
See also yaniv's answer - which seems to be an alternative way to reverse the order of the evaluation by being explicit about depth.
After thinking about this, I found a simple, not complete, but good enough answer:
# A partially parallel solution. Just do the first level of recursion in parallel. It might be enough work to fill all cores.
import multiprocessing
def f_helper(data):
return f(x=data['x'],depth=data['depth'], recursion_depth=data['recursion_depth'])
def f(x, depth, recursion_depth):
if depth==0:
return ...
else :
if recursion_depth == 0:
pool = multiprocessing.Pool(processes=4)
result = [x] + pool.map(f_helper, [{'x':_x, 'depth':depth-1, 'recursion_depth':recursion_depth+1 } _x in list_of_values(x)])
pool.close()
else:
result = [x] + map(f_helper, [{'x':_x, 'depth':depth-1, 'recursion_depth':recursion_depth+1 } _x in list_of_values(x)])
return result
def list_of_values(x):
# Heavy compute, pure function
I store the main process id initially and transfer it to sub programs.
When I need to start a multiprocessing job, I check the number of children of the main process. If it is less than or equal to the half of my CPU count, then I run it as parallel. If it greater than the half of my CPU count, then I run it serial. In this way, it avoids bottlenecks and uses CPU cores effectively. You can tune the number of cores for your case. For example, you can set it to the exact number of CPU cores, but you should not exceed it.
def subProgramhWrapper(func, args):
func(*args)
parent = psutil.Process(main_process_id)
children = parent.children(recursive=True)
num_cores = int(multiprocessing.cpu_count()/2)
if num_cores >= len(children):
#parallel run
pool = MyPool(num_cores)
results = pool.starmap(subProgram, input_params)
pool.close()
pool.join()
else:
#serial run
for input_param in input_params:
subProgramhWrapper(subProgram, input_param)