Python core usage slower/under 100% with multiprocessing.Pool - python

Code that runs on one core # 100% actually runs slower when multiprocessed, where it runs on several cores # ~50%.
This question is asked frequently, and the best threads I've found about it (0, 1) give the answer, "It's because the workload isn't heavy enough, so the inter-process communication (IPC) overhead ends up making things slower."
I don't know whether or not this is right, but I've isolated an example where this happens AND doesn't happen for the same workload, and I want to know whether this answer still applies or why it actually happens:
from multiprocessing import Pool
def f(n):
res = 0
for i in range(n):
res += i**2
return res
def single(n):
""" Single core """
for i in range(n):
f(n)
def multi(n):
""" Multi core """
pool = Pool(2)
for i in range(n):
pool.apply_async(f, (n,))
pool.close()
pool.join()
def single_r(n):
""" Single core, returns """
res = 0
for i in range(n):
res = f(n) % 1000 # Prevent overflow
return res
def multi_r(n):
""" Multi core, returns """
pool = Pool(2)
res = 0
for i in range(n):
res = pool.apply_async(f, (n,)).get() % 1000
pool.close()
pool.join()
return res
# Run
n = 5000
if __name__ == "__main__":
print(f"single({n})...", end='')
single(n)
print(" DONE")
print(f"multi({n})...", end='')
multi(n)
print(" DONE")
print(f"single_r({n})...", end='')
single_r(n)
print(" DONE")
print(f"multi_r({n})...", end='')
multi_r(n)
print(" DONE")
The workload is f().
f() is run single-cored and dual-cored without return calls via single() and multi().
Then f() is run single-cored and dual-cored with return calls via single_r() and multi_r().
My result is that slowdown happens when f() is run multiprocessed with return calls. Without returns, it doesn't happen.
So single() takes q seconds. multi() is much faster. Good. Then single_r() takes q seconds. But then multi_r() takes much more than q seconds. Visual inspection of my system monitor corroborates this (a little hard to tell, but the multi(n) hump is shaded two colors, indicating activity from two different cores).
Also, corroborating video of the terminal outputs
Even with uniform workload, is this still IPC overhead? Is such overhead only paid when other processes return their results, and, if so, is there a way to avoid it while still returning results?

As Darkonaut pointed out, the slowdown when using multiple processes in multi_r() is because the get() call is blocking:
for i in range(n):
res = pool.apply_async(f, (n,)).get() % 1000
This effectively runs the workload sequentially or concurrently (more akin to multithreaded) while adding multiprocess overhead, making it run slower than the single-cored equivalent single_r()!
Meanwhile, multi() ran faster (i.e., ran in parallel correctly) because it contains no get() calls.
To run parallel and return results, collect result objects first as in:
def multi_r_collected(n):
""" Multi core, collects apply_async() results before returning them """
pool = Pool(2)
res = 0
res = [pool.apply_async(f, (n,)) for i in range(n)] # Collect first!
pool.close()
pool.join()
res = [r.get() % 1000 for r in res] # .get() after!
return res
Visual inspection of CPU activity corroborates the noticed speed-up; when run with 12 processes via Pool(12), there's a clean, uniform mesa of multiple cores clearly running at 100% in parallel (not the 50% mishmash of multi_r(n)).

Related

Python-MultiThreading: Can MultiThreading improve "for loop" performance?

As far as my understanding:
MultiThread is an ideal option for I/O applications.
Therefore, I test a "for loop" code without any I/O.
(As following code)
Howerver, it can reduce the execution time from 6.3s to 3.7s.
Is the result correct ?
or any mistake in my suppose ?
from multiprocessing.dummy import Pool as ThreadPool
import time
# normal
l = []
s = time.time()
for i in range(0, 10000):
for j in range(i):
l.append(j * 10)
e = time.time()
print(f"case1: {e-s}") # 6.3 sec
# multiThread
def func(x):
for i in range(x):
l_.append(i * 10)
with ThreadPool(50) as pool:
l_ = []
s = time.time()
pool.map(func, range(0, 10000))
e = time.time()
print(f"case2: {e-s}") # 3.7 sec
what you are seeing is just python specializing the function by using faster op-codes for the multithreaded version as it is a function that is called multiple times, See PEP 659 Specializing Adaptive Interpreter, this only true for python 3.11 and later.
changing the non-multithreaded version to also be a function that is called multiple times give almost the same performance (python 3.11)
from multiprocessing.dummy import Pool as ThreadPool
import time
l = []
def f2(i):
for j in range(i):
l.append(j * 10)
def f1():
# normal
for i in range(0, 10_000):
f2(i)
s = time.time()
f1()
e = time.time()
print(f"case1: {e-s}")
# multiThread
def func(x):
global l_
for i in range(x):
l_.append(i * 10)
with ThreadPool(50) as pool:
l_ = []
s = time.time()
pool.map(func, range(0, 10_000))
e = time.time()
print(f"case2: {e-s}")
case1: 3.9744303226470947
case2: 4.036579370498657
threading in python IS going to be slower for functions that need the GIL, and manipulating lists requires the GIL so using threading will be slower, python threads only improve performance if the GIL is dropped (which happens on IO and C external libraries calls), if this is ever not the case then either your code drops the GIL or your benchmark is flawed.
In general it is true that multithreading is better suited for I/O bound operations. However, in this trivial case it is clearly not so.
It's worth pointing out multiprocessing will outperform either of the strategies implemented in OP's code.
Here's a rewrite that demonstrates 3 techniques:
from functools import wraps
from time import perf_counter
def timeit(func):
#wraps(func)
def _wrapper(*args, **kwargs):
start = perf_counter()
result = func(*args, **kwargs)
duration = perf_counter() - start
print(f'Function {func.__name__}{args} {kwargs} Took {duration:.4f} seconds')
return result
return _wrapper
#timeit
def case1():
l = []
for i in range(0, 10000):
for j in range(i):
l.append(j * 10)
def process(n):
l = []
for j in range(n):
l.append(j * 10)
#timeit
def case2():
with TPE() as tpe:
tpe.map(process, range(0, 10_000))
#timeit
def case3():
with PPE() as ppe:
ppe.map(process, range(0, 10_000))
if __name__ == '__main__':
for func in case1, case2, case3:
func()
Output:
Function case1() {} Took 3.3104 seconds
Function case2() {} Took 2.6354 seconds
Function case3() {} Took 1.7245 seconds
In this case the trivial processing is probably outweighed by the overheads in thread management. If case1 was even more CPU intensive you'd probably begin to see rather different results
Multi threading is ideal for I/O applications because it allows a server/host to accept multiple connections, and if a single request is slow or hangs, it can continue serving the other connections without blocking them.
That isn't mutually exclusive from speeding up a simple for loop execution, if there's no coordination between threads required like in your trivial example above. If each execution of loop is completely independent of any other executions, then it's also very well suited to multi threading, and that's why you're seeing a speed up.

Multiprocessing: Shared memory slower than pickling?

I am trying to get myself acquainted to multiprocessing in Python. Performance does not work out as I expected; therefore, I am seeking advice how to make things work more efficiently.
Let my first state my objective: I basically have a bunch data of lists. Each of these lists can be processed independently, say by some dummy routine do_work. My implementation in my actual program is slow (slower than doing the same in a single process serially). I was wondering if this is due to the pickling/unpickling overhead involved into multiprocess programming.
Therefore, I tried to implement a version using shared memory. Since the way how I distribute the work makes sure that no two processes try to write to the same piece of memory at the same time, I use multiprocessing.RawArray and RawValue. As it turns out, the version with shared memory is even slower.
My code is as follows: main_pass and worker_pass implement the parallelisation using return-statements, while main_shared and worker_shared use shared memory.
import multiprocessing, time, timeit, numpy as np
data = None
def setup():
return np.random.randint(0,100, (1000,100000)).tolist(), list(range(1000))
def do_work(input):
output = []
for j in input:
if j % 3 == 0:
output.append(j)
return output
def main_pass():
global data
data, instances = setup()
with multiprocessing.Pool(4) as pool:
start = time.time()
new_blocks = pool.map(worker_pass, instances)
print("done", time.time() - start)
def worker_pass(i):
global data
return do_work(data[i])
def main_shared():
global data
data, instances = setup()
data = [(a := multiprocessing.RawArray('i', block), multiprocessing.RawValue('i', len(a))) for block in data]
with multiprocessing.Pool(4) as pool:
start = time.time()
pool.map(worker_shared, instances)
print("done", time.time() - start)
new_blocks = [list(a[:l.value]) for a, l in data]
print(new_blocks)
def worker_shared(i):
global data
array, length = data[i]
new_block = do_work(array[:length.value])
array[:len(new_block)] = new_block
length.value = len(new_block)
import timeit
if __name__ == '__main__':
multiprocessing.set_start_method('fork')
print(timeit.timeit(lambda: main_pass(), number=1))
print(timeit.timeit(lambda: main_shared(), number=1))
the timing I get:
done 7.257717132568359
10.633161254
done 7.889772891998291
38.037218965
So the version run first (using return) is way faster than the one writing the result to shared memory.
Why is this?
Btw., is it possible to measure the time spent on pickling/unpickling conveniently?
Info: I am using python 3.9 on MacOS 10.15.
What you say about the returned output from worker_pass being done by pickling is true but that additional overhead is clearly does not seem to compensate for the additional work being done by worker_shared to "repack" the RawArray instances. Where a performance improvement is achieved is when you are forced to use pickling for the worker_pass case as when you are on platforms that use spawn to create new processes.
In the following spawn demo I seed the random number generator with a specific value so I get the same generated values for both runs and I print out the sum of all the returned random numbers just to ensure that both runs are doing equivalent processing. It is clear that using shared memory arrays performs better now if you are only timing the pool-creation (where the overhead is for the non-shared memory case) and map times. But when you include the additional setup time and post-processing time required for the use of the shared memory arrays, the difference in times is not that significant:
import multiprocessing, time, timeit, numpy as np
def setup():
np.random.seed(seed=1)
return np.random.randint(0,100, (1000,100000)).tolist(), list(range(1000))
def init_process_pool(the_data):
global data
data = the_data
def do_work(input):
output = []
for j in input:
if j % 3 == 0:
output.append(j)
return output
def main_pass():
data, instances = setup()
start = time.time()
with multiprocessing.Pool(4, initializer=init_process_pool, initargs=(data,)) as pool:
new_blocks = pool.map(worker_pass, instances)
print("done", time.time() - start)
print(sum(sum(new_block) for new_block in new_blocks))
def worker_pass(i):
global data
return do_work(data[i])
def main_shared():
data, instances = setup()
data = [(a := multiprocessing.RawArray('i', block), multiprocessing.RawValue('i', len(a))) for block in data]
start = time.time()
with multiprocessing.Pool(4, initializer=init_process_pool, initargs=(data,)) as pool:
pool.map(worker_shared, instances)
print("done", time.time() - start)
new_blocks = [list(a[:l.value]) for a, l in data]
#print(new_blocks)
print(sum(sum(new_block) for new_block in new_blocks))
def worker_shared(i):
global data
array, length = data[i]
new_block = do_work(array[:length.value])
array[:len(new_block)] = new_block
length.value = len(new_block)
import timeit
if __name__ == '__main__':
multiprocessing.set_start_method('spawn')
print(timeit.timeit(lambda: main_pass(), number=1))
print(timeit.timeit(lambda: main_shared(), number=1))
Prints:
done 17.68915629386902
1682969169
20.2827687
done 3.9250364303588867
1682969169
23.2993996

Eliminating overhead in multiprocessing with pool

I am currently in a situation where I have parallelized code called repeatedly and try to reduce the overhead associated with the multiprocessing. So, consider the following example, which deliberately contains no "expensive" computations:
import multiprocessing as mp
def f(x):
# toy function
return x*x
if __name__ == '__main__':
for x in range(500):
pool = mp.Pool(processes=2)
print(pool.map(f, range(x, x + 50)))
pool.close()
pool.join() # necessary?
This code takes 53 seconds compared to 0.04 seconds for the sequential approach.
First question: do I really need to call pool.join() in this case when only pool.map() is ever used? I cannot find any negative effects from omitting it and the runtime would drop to 4.8 seconds. (I understand that omitting pool.close() is not possible, as we would be leaking threads then.)
Now, while this would be a nice improvement, as a first answer I would probably get "well, don't create the pool in the loop in the first place". Ok, no problem, but the parallelized code actually lives in an instance method, so I would use:
class MyObject:
def __init__(self):
self.pool = mp.Pool(processes=2)
def function(self, x):
print(self.pool.map(f, range(x, x + 50)))
if __name__ == '__main__':
my_object = MyObject()
for x in range(500):
my_object.function(x)
This would be my favorite solution as it runs in excellent 0.4 seconds.
Second question: should I call pool.close()/pool.join() somewhere explicitly (e.g. in the destructor of MyObject) or is the current code sufficient? (If it matters: it is ok to assume there are only a few long-lived instances of MyObject in my project.)
Of course it takes a long time: you keep allocating a new pool and destroying it for every x.
It will run much faster if instead you do:
if __name__ == '__main__':
pool = mp.Pool(processes=2) # allocate the pool only once
for x in range(500):
print(pool.map(f, range(x, x + 50)))
pool.close() # close it only after all the requests are submitted
pool.join() # wait for the last worker to finish
Try that and you'll see it now works much faster.
Here are links to the docs for join and close:
Once close is called you can't submit more tasks to the pool, and join waits till the last worker finished its job. They should be called in that order (first close then join).
Well, actually you could pass already allocated pool as argument to your object:
class MyObject:
def __init__(self, pool):
self.pool = pool
def function(self, x):
print(self.pool.map(f, range(x, x + 50)))
if __name__ == '__main__':
with mp.Pool(2) as pool:
my_object = MyObject(pool)
my_second_object = MyObject(pool)
for x in range(500):
my_object.function(x)
my_second_object.function(x)
pool.close()
I can not find a reason why it might be necessary to use different pools in different objects

How to parallel sum a loop using multiprocessing in Python

I am having difficulty understanding how to use Python's multiprocessing module.
I have a sum from 1 to n where n=10^10, which is too large to fit into a list, which seems to be the thrust of many examples online using multiprocessing.
Is there a way to "split up" the range into segments of a certain size and then perform the sum for each segment?
For instance
def sum_nums(low,high):
result = 0
for i in range(low,high+1):
result += i
return result
And I want to compute sum_nums(1,10**10) by breaking it up into many sum_nums(1,1000) + sum_nums(1001,2000) + sum_nums(2001,3000)... and so on. I know there is a close-form n(n+1)/2 but pretend we don't know that.
Here is what I've tried
import multiprocessing
def sum_nums(low,high):
result = 0
for i in range(low,high+1):
result += i
return result
if __name__ == "__main__":
n = 1000
procs = 2
sizeSegment = n/procs
jobs = []
for i in range(0, procs):
process = multiprocessing.Process(target=sum_nums, args=(i*sizeSegment+1, (i+1)*sizeSegment))
jobs.append(process)
for j in jobs:
j.start()
for j in jobs:
j.join()
#where is the result?
I find the usage of multiprocess.Pool and map() much more simple
Using your code:
from multiprocessing import Pool
def sum_nums(args):
low = int(args[0])
high = int(args[1])
return sum(range(low,high+1))
if __name__ == "__main__":
n = 1000
procs = 2
sizeSegment = n/procs
# Create size segments list
jobs = []
for i in range(0, procs):
jobs.append((i*sizeSegment+1, (i+1)*sizeSegment))
pool = Pool(procs).map(sum_nums, jobs)
result = sum(pool)
>>> print result
>>> 500500
You can do this sum without multiprocessing at all, and it's probably simpler, if not faster, to just use generators.
# prepare a generator of generators each at 1000 point intervals
>>> xr = (xrange(1000*i+1,i*1000+1001) for i in xrange(10000000))
>>> list(xr)[:3]
[xrange(1, 1001), xrange(1001, 2001), xrange(2001, 3001)]
# sum, using two map functions
>>> xr = (xrange(1000*i+1,i*1000+1001) for i in xrange(10000000))
>>> sum(map(sum, map(lambda x:x, xr)))
50000000005000000000L
However, if you want to use multiprocessing, you can also do this too. I'm using a fork of multiprocessing that is better at serialization (but otherwise, not really different).
>>> xr = (xrange(1000*i+1,i*1000+1001) for i in xrange(10000000))
>>> import pathos
>>> mmap = pathos.multiprocessing.ProcessingPool().map
>>> tmap = pathos.multiprocessing.ThreadingPool().map
>>> sum(tmap(sum, mmap(lambda x:x, xr)))
50000000005000000000L
The version w/o multiprocessing is faster and takes about a minute on my laptop. The multiprocessing version takes a few minutes due to the overhead of spawning multiple python processes.
If you are interested, get pathos here: https://github.com/uqfoundation
First, the best way to get around the memory issue is to use an iterator/generator instead of a list:
def sum_nums(low, high):
result = 0
for i in xrange(low, high+1):
result += 1
return result
in python3, range() produces an iterator, so this is only needed in python2
Now, where multiprocessing comes in is when you want to split up the processing to different processes or CPU cores. If you don't need to control the individual workers than the easiest method is to use a process pool. This will let you map a function to the pool and get the output. You can alternatively use apply_async to apply jobs to the pool one at a time and get a delayed result which you can get with .get():
import multiprocessing
from multiprocessing import Pool
from time import time
def sum_nums(low, high):
result = 0
for i in xrange(low, high+1):
result += i
return result
# map requires a function to handle a single argument
def sn((low,high)):
return sum_nums(low, high)
if __name__ == '__main__':
#t = time()
# takes forever
#print sum_nums(1,10**10)
#print '{} s'.format(time() -t)
p = Pool(4)
n = int(1e8)
r = range(0,10**10+1,n)
results = []
# using apply_async
t = time()
for arg in zip([x+1 for x in r],r[1:]):
results.append(p.apply_async(sum_nums, arg))
# wait for results
print sum(res.get() for res in results)
print '{} s'.format(time() -t)
# using process pool
t = time()
print sum(p.map(sn, zip([x+1 for x in r], r[1:])))
print '{} s'.format(time() -t)
On my machine, just calling sum_nums with 10**10 takes almost 9 minutes, but using a Pool(8) and n=int(1e8) reduces this to just over a minute.

Parallel recursive function in Python

How do I parallelize a recursive function in Python?
My function looks like this:
def f(x, depth):
if x==0:
return ...
else :
return [x] + map(lambda x:f(x, depth-1), list_of_values(x))
def list_of_values(x):
# Heavy compute, pure function
When trying to parallelize it with multiprocessing.Pool.map, Windows opens an infinite number of processes and hangs.
What's a good (preferably simple) way to parallelize it (for a single multicore machine)?
Here is the code that hangs:
from multiprocessing import Pool
pool = pool(processes=4)
def f(x, depth):
if x==0:
return ...
else :
return [x] + pool.map(lambda x:f(x, depth-1), list_of_values(x))
def list_of_values(x):
# Heavy compute, pure function
OK, sorry for the problems with this.
I'm going to answer a slightly different question where f() returns the sum of the values in the list. That is because it's not clear to me from your example what the return type of f() would be, and using an integer makes the code simple to understand.
This is complex because there are two different things happening in parallel:
the calculation of the expensive function in the pool
the recursive expansion of f()
I am very careful to only use the pool to calculate the expensive function. In that way we don't get an "explosion" of processes, but because this is asynchronous we need to postpone a lot of work for the callback that the worker calls once the expensive function is done.
More than that, we need to use a countdown latch so that we know when all the separate sub-calls to f() are complete.
There may be a simpler way (I am pretty sure there is, but I need to do other things), but perhaps this gives you an idea of what is possible:
from multiprocessing import Pool, Value, RawArray, RLock
from time import sleep
class Latch:
'''A countdown latch that lets us wait for a job of "n" parts'''
def __init__(self, n):
self.__counter = Value('i', n)
self.__lock = RLock()
def decrement(self):
with self.__lock:
self.__counter.value -= 1
print('dec', self.read())
return self.read() == 0
def read(self):
with self.__lock:
return self.__counter.value
def join(self):
while self.read():
sleep(1)
def list_of_values(x):
'''An expensive function'''
print(x, ': thinking...')
sleep(1)
print(x, ': thought')
return list(range(x))
pool = Pool()
def async_f(x, on_complete=None):
'''Return the sum of the values in the expensive list'''
if x == 0:
on_complete(0) # no list, return 0
else:
n = x # need to know size of result beforehand
latch = Latch(n) # wait for n entires to be calculated
result = RawArray('i', n+1) # where we will assemble the map
def delayed_map(values):
'''This is the callback for the pool async process - it runs
in a separate thread within this process once the
expensive list has been calculated and orchestrates the
mapping of f over the result.'''
result[0] = x # first value in list is x
for (v, i) in enumerate(values):
def callback(fx, i=i):
'''This is the callback passed to f() and is called when
the function completes. If it is the last of all the
calls in the map then it calls on_complete() (ie another
instance of this function) for the calling f().'''
result[i+1] = fx
if latch.decrement(): # have completed list
# at this point result contains [x]+map(f, ...)
on_complete(sum(result)) # so return sum
async_f(v, callback)
# Ask worker to generate list then call delayed_map
pool.apply_async(list_of_values, [x], callback=delayed_map)
def run():
'''Tie into the same mechanism as above, for the final value.'''
result = Value('i')
latch = Latch(1)
def final_callback(value):
result.value = value
latch.decrement()
async_f(6, final_callback)
latch.join() # wait for everything to complete
return result.value
print(run())
PS: I am using Python 3.2 and the ugliness above is because we are delaying computation of the final results (going back up the tree) until later. It's possible something like generators or futures could simplify things.
Also, I suspect you need a cache to avoid needlessly recalculating the expensive function when called with the same argument as earlier.
See also yaniv's answer - which seems to be an alternative way to reverse the order of the evaluation by being explicit about depth.
After thinking about this, I found a simple, not complete, but good enough answer:
# A partially parallel solution. Just do the first level of recursion in parallel. It might be enough work to fill all cores.
import multiprocessing
def f_helper(data):
return f(x=data['x'],depth=data['depth'], recursion_depth=data['recursion_depth'])
def f(x, depth, recursion_depth):
if depth==0:
return ...
else :
if recursion_depth == 0:
pool = multiprocessing.Pool(processes=4)
result = [x] + pool.map(f_helper, [{'x':_x, 'depth':depth-1, 'recursion_depth':recursion_depth+1 } _x in list_of_values(x)])
pool.close()
else:
result = [x] + map(f_helper, [{'x':_x, 'depth':depth-1, 'recursion_depth':recursion_depth+1 } _x in list_of_values(x)])
return result
def list_of_values(x):
# Heavy compute, pure function
I store the main process id initially and transfer it to sub programs.
When I need to start a multiprocessing job, I check the number of children of the main process. If it is less than or equal to the half of my CPU count, then I run it as parallel. If it greater than the half of my CPU count, then I run it serial. In this way, it avoids bottlenecks and uses CPU cores effectively. You can tune the number of cores for your case. For example, you can set it to the exact number of CPU cores, but you should not exceed it.
def subProgramhWrapper(func, args):
func(*args)
parent = psutil.Process(main_process_id)
children = parent.children(recursive=True)
num_cores = int(multiprocessing.cpu_count()/2)
if num_cores >= len(children):
#parallel run
pool = MyPool(num_cores)
results = pool.starmap(subProgram, input_params)
pool.close()
pool.join()
else:
#serial run
for input_param in input_params:
subProgramhWrapper(subProgram, input_param)

Categories