Python multiprocessing pool execution time compared to non-multiprocessing execution time - python

I am currently redesigning a program to use Python's multiprocessing pools. My first impression was that the execution time increased instead of decreased. Therefore, I got curious and wrote a little test script:
import time
import multiprocessing
def simple(x):
return 2*x
def less_simple(x):
b = x
for i in range(0, 100):
b = b * i
return 2*x
a = list(range(0,1000000))
print("without multiprocessing:")
before = time.time()
res = map(simple, a)
after = time.time()
print(str(after - before))
print("-----")
print("with multiprocessing:")
for i in range(1, 5):
before = time.time()
with multiprocessing.Pool(processes=i) as pool:
pool.map(simple, a)
after = time.time()
print(str(i) + " processes: " + str(after - before))
I get the following results:
without multiprocessing:
2.384185791015625e-06
with multiprocessing:
1 processes: 0.35068225860595703
2 processes: 0.21297240257263184
3 processes: 0.21887946128845215
4 processes: 0.3474385738372803
When I replace simple with less_simple in lines 21 and 31, I get the following results:
without multiprocessing:
2.6226043701171875e-06
with multiprocessing:
1 processes: 3.1453816890716553
2 processes: 1.615351676940918
3 processes: 1.6125438213348389
4 processes: 1.5159809589385986
Honestly, I am a bit confused because the non-multiprocessing version is always some orders of magnitudes faster. Additionally, an increase of the process number seems to have little to no influence on the runtime. Therefore, I have a few questions:
Do I make some mistake in the usage of multiprocessing?
Are my test functions to simple to get a positive impact from multiprocessing?
Is there a chance to estimate at which point multiprocessing has an advantage or do I have to test it?

I did some more research and basically, you are right. Both functions are rather small and somewhat artificial. However, there is a measurable time difference between non-multiprocessing and multiprocessing even for those functions, when you take into consideration how map works. The map function only returns an iterator yielding the results [1], i.e., in the above example, it only creates the iterator which is of course very fast.
Therefore, I replaced the map function with a traditional for loop:
for elem in a:
res = simple(a)
For the simple function, the execution is still faster without multiprocessing because the overhead is too big for such a small function:
without multiprocessing:
0.1392803192138672
with multiprocessing:
1 processes: 0.38080787658691406
2 processes: 0.22507309913635254
3 processes: 0.21307945251464844
4 processes: 0.2152390480041504
However, in case of the function less_simple, you can see an actual advantage of multiprocessing:
without multiprocessing:
3.2029929161071777
with multiprocessing:
1 processes: 3.4934208393096924
2 processes: 1.8259460926055908
3 processes: 1.9196875095367432
4 processes: 1.716357946395874
[1] https://docs.python.org/3/library/functions.html#map

Related

Can one efficiently thread short CPU tasks in python?

I am trying to streamline a program that involves a set of short tasks that can be done in parallel, where the results of the set of tasks must be compared before moving onto the next step (which again involves a set of short tasks, and then another set, etc.). Due to the level of complexity of these tasks, it's not worthwhile to use multiprocessing due to the set-up time. I am wondering if there is another way to do these short tasks in parallel that is faster than linear. The only question I can find on this site that describes this problem for Python references this answer on memory sharing which I don't think answers my question (or if it does I could not follow how).
To illustrate what I am hoping to do, consider the problem of summing a bunch of numbers from 0 to N. (Of course this can be solved analytically, my point is to come up with a low-memory but short CPU-intensive task.) First, the linear approach would simply be:
def numbers(a,b):
return(i for i in range(a,b))
def linear_sum(a):
return(sum(numbers(a[0],a[1])))
n = 2000
linear_sum([0, n+1])
#2001000
For threading, I want to break the problem into parts that can then be summed separately and then combined, so the idea would be to get a bunch of ranges over which to sum with something like
def get_ranges(i, Nprocess = 3):
di = i // Nprocess
j = np.append(np.arange(0, i, di), [i+1,])
return([(j[k], j[k+1]) for k in range(len(j)-1)])
and for some value n >> NProcesses the pseudocode example would be something like
values = get_ranges(n)
x = []
for value in values:
x.append(do_someting_parallel(value))
return(sum(x))
The question then, is how to implement do_someting_parallel? For multiprocessing, we can do something like:
from multiprocessing import Pool as ThreadPool
def mpc_thread_sum(i, Nprocess = 3):
values = get_ranges(i)
pool = ThreadPool(Nprocess)
results = pool.map(linear_sum, values)
pool.close()
pool.join()
return(sum(results))
print(mpc_thread_sum(2000))
# 2001000
The graph below shows the performance of the different approaches described. Is there a way to speed up computations for the region where multiprocessing is still slower than linear or is this the limit of parallelization in Python's GIL? I suspect the answer may be that I am hitting my limit but wanted to ask here to be sure. I tried multiprocessing.dummy, asyncio, threading, and ThreadPoolExecutor (from concurrent.futures). For brevity, I have omitted the code, but all show comparable execution time to the linear approach. All are designed for I/O tasks, so are confined by GIL.
My first observation is that the running time of function numbers can be cut roughly in half by simply defining it as:
def numbers(a, b):
return range(a, b)
Second, a task that is 100% CPU-intensive as is computing the sum of numbers can never perform significantly better using pure Python without the aid of a C-language runtime library (such as numpy) because of contention for the Global Interpreter Lock (GIL), which prevents any sort of parallelization from occurring (and asyncio only uses a single thread to being with).
Third, the only way you can achieve a performance improvement running pure Python code against a 100% CPU task is with multiprocessing. But there is CPU overhead in creating the process pool and CPU overhead in passing arguments from the main process to the the address space in which the process pool's processes are running in and overhead again in passing back the results. So for there to be any performance improvement, the worker function linear_sum, cannot be trivial; it must require enough CPU processing to warrant the additional overhead I just mentioned.
The following benchmark runs the worker function, renamed to compute_sum and which now accepts as its argument a range. To further reduce overhead, I have introduced a function split that will take the passed range argument and generate multiple range instances removing the need to use numpy and to generate arrays. The benchmark computes the sum using a single thread (linear), a multithreading pool and a multiprocessing pool and is run twice for n = 2000 and n = 50_000_000. The benchmark displays the elapsed time and total CPU time across all processes.
For n = 2000, multiprocessing, as expected, performs worse than both linear and multithreading. For n = 50_000_000, multiprocessing's total CPU time is a bit higher than for linear and multithreading as is expected due to the additional aforementioned overhead. But now the elapsed time has gone down significantly. For both values of n, multithreading is a loser.
from multiprocessing.pool import Pool, ThreadPool
import time
def split(iterable, n):
k, m = divmod(len(iterable), n)
return (iterable[i * k + min(i, m):(i + 1) * k + min(i + 1, m)] for i in range(n))
def compute_sum(r):
t = time.process_time()
return (sum(r), time.process_time() - t)
if __name__ == '__main__':
for n in (2000, 50_000_000):
r = range(0, n+1)
t1 = time.time()
s, cpu = compute_sum(r)
elapsed = time.time() - t1
print(f'n = {n}, linear elapsed time = {elapsed}, total cpu time = {cpu}, sum = {s}')
t1 = time.time()
t2 = time.process_time()
thread_pool = ThreadPool(4)
s = 0
for return_value, process_time in thread_pool.imap_unordered(compute_sum, split(r, 4)):
s += return_value
elapsed = time.time() - t1
cpu = time.process_time() - t2
print(f'n = {n}, thread pool elapsed time = {elapsed}, total cpu time = {cpu}, sum = {s}')
thread_pool.close()
thread_pool.join()
t1 = time.time()
t2 = time.process_time()
pool = Pool(4)
s = 0
cpu = 0
for return_value, process_time in pool.imap_unordered(compute_sum, split(r, 4)):
s += return_value
cpu += process_time
elapsed = time.time() - t1
cpu += time.process_time() - t2
print(f'n = {n}, multiprocessing elapsed time = {elapsed}, total cpu time = {cpu}, sum = {s}')
pool.close()
pool.join()
print()
Prints:
n = 2000, linear elapsed time = 0.0, total cpu time = 0.0, sum = 2001000
n = 2000, thread pool elapsed time = 0.00700068473815918, total cpu time = 0.015625, sum = 2001000
n = 2000, multiprocessing elapsed time = 0.13200139999389648, total cpu time = 0.015625, sum = 2001000
n = 50000000, linear elapsed time = 2.0311124324798584, total cpu time = 2.03125, sum = 1250000025000000
n = 50000000, thread pool elapsed time = 2.050999164581299, total cpu time = 2.046875, sum = 1250000025000000
n = 50000000, multiprocessing elapsed time = 0.7579991817474365, total cpu time = 2.359375, sum = 125000002500000

Add new tasks when the number of active jobs becomes less than N

I am trying to parallelize some very time-consuming tasks on a certain number of cores. My goal is to exploit the server resources as much as possible. The total number of CPUs is 20, but the number of tasks to do is much more (say, 100).
Each task runs for a different time, so the code below leaves an opportunity for a certain number of cores to idle without work.
import multiprocessing as mp
def some_task(*args):
# Something happening here
pass
cpu_count = mp.cpu_count()
p = mp.Pool(cpu_count)
n_thread = 1
for something in somethings:
p.apply_async(some_task, args=(something, ))
if n_thread == cpu_count:
p.close()
p.join()
p = mp.Pool(cpu_count)
n_thread = 1
continue
n_thread += 1
p.close()
if n_thread != 1:
p.join()
How to program the code that will start new tasks when the number of active jobs becomes less than the number of CPUs?

Can I speed up performance by applying functions to an item in a data object with multiprocessing?

Disclaimer: I have gone through loads of multiprocessing answers on SO and also documents and either the questions were really old (Python 3.X has made tons of improvements since) or did not find a clear answer. If I might have missed out something relevant do point me in the right direction.
I started with a simple function that I defined as below in my folder module since I am running of Jupyter Notebook and it seems that due to conflicts, you can only run multiprocessing on an imported function:
def f(a):
return a * 100
Built some test data and ran some test:
from itertools import zip_longest
from multiprocessing import Process, Pool, Array, Queue
from time import time
from modules.test import *
li = [i for i in range(1000000)]
List comprehension: Really Fast
start = time()
tests = [f(i) for i in li]
print(f'Total time {time() - start} s')
>> Total time 0.154066801071167 s
Answer of an SO example here: 11 seconds or so
start = time()
results = []
if __name__ == '__main__':
jobs = 4
size = len(li)
heads = list(range(size//jobs, size, size//jobs)) + [size]
tails = range(0,size,size//jobs)
pool = Pool(4)
for tail,head in zip(tails, heads):
r = pool.apply_async(f, args=(li[tail:head],))
results.append(r)
pool.close()
pool.join() # wait for the pool to be done
print(f'Total time {time() - start} s')
>>Total time 11.087551593780518 s
And there is Process which I do not know whether will be applicable to the example above. I am unfamiliar with multiprocessing but do understand that there is some overhead in creating new instances and what not, but as the data grows larger it should justify the overhead.
My question is, with the current performances in Python 3.x, is using multiprocessing in running similar operations to the above still relevant or something one should even attempt. And if it is, how can they be applied in parallelizing the workload.
Most of the examples I have read and understand are used for web scraping when there is an actual idle time in one process receiving information and it makes sense to parallelize but how would want approach it if you are running computations of something like a list or dictionary.
The reason your example is not performing well is because you are doing two totally different things.
In your list comprehension, you are mapping f onto each element of li.
In the second case, you are splitting your li list into jobs chunks and then apply your functon jobs times onto each of those chunks. And now, in f, n * 100 takes a chunk about a quarter the size of your original list, and multiplies it by 100, i.e., it uses the sequence-repitition operator, so creates a new list 100-times the size of the chunk:
>>> chunk = [1,2,3]
>>> chunk * 10
[1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3]
>>>
So basically, you are comparing apples to oranges.
However, multiprocessing already comes with an out-of-the box mapping utility. Here is a better comparison, a script called foo.py:
import time
import multiprocessing as mp
def f(x):
return x * 100
if __name__ == '__main__':
data = list(range(1000000))
start = time.time()
[f(i) for i in data]
stop = time.time()
print(f"List comprehension took {stop - start} seconds")
start = time.time()
with mp.Pool(4) as pool:
result = pool.map(f, data)
stop = time.time()
print(f"Pool.map took {stop - start} seconds")
Now, here's some actual performance results:
(py37) Juans-MBP:test_mp juan$ python foo.py
List comprehension took 0.14193987846374512 seconds
Pool.map took 0.2513458728790283 seconds
(py37) Juans-MBP:test_mp juan$
For this very trivial function, the cost of the inter-process communication will always be higher than the cost of calculating the function serially. So you won't see any gains from multiprocessing. However, a much less trivial function can see gains from multiprocessing.
Here's a trivial example, I simply sleep for a microsecond before multiplying:
import time
import multiprocessing as mp
def f(x):
time.sleep(0.000001)
return x * 100
if __name__ == '__main__':
data = list(range(1000000))
start = time.time()
[f(i) for i in data]
stop = time.time()
print(f"List comprehension took {stop - start} seconds")
start = time.time()
with mp.Pool(4) as pool:
result = pool.map(f, data)
stop = time.time()
print(f"Pool.map took {stop - start} seconds")
And now, you see gains commensurate with the number of processes:
(py37) Juans-MBP:test_mp juan$ python foo.py
List comprehension took 13.175776720046997 seconds
Pool.map took 3.1484851837158203 seconds
Note, on my machine, a single multiplication takes orders of magnitude less time than a microsecond (about 10 nanoseconds):
>>> import timeit
>>> timeit.timeit('100*3', number=int(1e6))*1e-6
1.1292944999993892e-08

firing a sequence of parallel tasks

For this dask code:
def inc(x):
return x + 1
for x in range(5):
array[x] = delay(inc)(x)
I want to access all the elements in array by executing the delayed tasks. But I can't call array.compute() since array is not a function. If I do
for x in range(5):
array[x].compute()
then does each task gets executed in parallel or does a[1] get fired only after a[0] terminates? Is there a better way to write this code?
You can use the dask.compute function to compute many delayed values at once
from dask import delayed, compute
array = [delayed(inc)(i) for i in range(5)]
result = compute(*array)
It's easy to tell if things are executing in parallel if you force them to take a long time. If you run this code:
from time import sleep, time
from dask import delayed
start = time()
def inc(x):
sleep(1)
print('[inc(%s): %s]' % (x, time() - start))
return x + 1
array = [0] * 5
for x in range(5):
array[x] = delayed(inc)(x)
for x in range(5):
array[x].compute()
It becomes very obvious that the calls happen in sequence. However if you replace the last loop with this:
delayed(array).compute()
then you can see that they are in parallel. On my machine the output looks like this:
[inc(1): 1.00373506546]
[inc(4): 1.00429320335]
[inc(2): 1.00471806526]
[inc(3): 1.00475406647]
[inc(0): 2.00795912743]
Clearly the first four tasks it executed were in parallel. Presumably the default parallelism is set to the number of cores on the machine, because for CPU intensive tasks it's not generally useful to have more.

multiprocessing full capacity in Python

I wrote the following code which call function (compute_cluster) 6 times in parallel (each run of this function is independent of the other run and each run write the results in a separate file), the following is my code:
global L
for L in range(6,24):
pool = Pool(6)
pool.map(compute_cluster,range(1,3))
pool.close()
if __name__ == "__main__":
main(sys.argv)
despite the fact that I'm running this code on a I7 processor machine, and no matter how much I set the Pool to it's always running only two processes in parallel so is there any suggestion on how can I run 6 processes in parallel? such that the first three processes use L=6 and call compute_cluster with parameter values from 1:3 in parallel and at the same time the other three processes run the same function with the same parameter values but this time the Global L value is 7 ?
any suggestions is highly appreciated
There are a few things wrong here. First, as to why you always only have 2 processes going at a time -- The reason is because range(1, 3) only returns 2 values. So you're only giving the pool 2 tasks to do before you close it.
The second issue is that you're relying on global state. In this case, the code probably works, but it's limiting your performance since it is the factor which is preventing you from using all your cores. I would parallelize the L loop rather than the "inner" range loop. Something like1:
def wrapper(tup):
l, r = tup
# Even better would be to get rid of `L` and pass it to compute_cluster
global L
L = l
compute_cluster(r)
for r in range(1, 3):
p = Pool(6)
p.map(wrapper, [(l, r) for l in range(6, 24)])
p.close()
This works with the global L because each spawned process picks up its own copy of L -- It doesn't get shared between processes.
1Untested code
As pointed out in the comments, we can even pull the Pool out of the loop:
p = Pool(6)
p.map(wrapper, [(l, r) for l in range(6, 24) for r in range(1, 3)])
p.close()

Categories