Multiprocessing in Python not faster than doing it sequentially

Multiprocessing in Python not faster than doing it sequentially - python

I want to do something parallelly but it always goes slower. I put an example of two code snippets which can be compared. The multiprocessing way needs 12 seconds on my laptop. The sequential way only 3 seconds. I thought multiprocessing is faster.
I know that the task in this way does not make any sense but it is just made to compare the two ways. I know bubble sort can be replaced by faster ways.
Thanks.
Multiprocessing way:
from multiprocessing import Process, Manager
import os
import random
myArray = []
for i in range(1000):
myArray.append(random.randint(1,1000))
def getRandomSample(myset, sample_size):
sorted_list = sorted(random.sample(xrange(len(myset)), sample_size))
return([myset[i] for i in sorted_list])
def bubbleSort(iterator,alist, return_dictionary):
sample_list = (getRandomSample(alist, 100))
for passnum in range(len(sample_list)-1,0,-1):
for i in range(passnum):
if sample_list[i]>alist[i+1]:
temp = alist[i]
sample_list[i] = alist[i+1]
sample_list[i+1] = temp
return_dictionary[iterator] = sample_list
if __name__ == '__main__':
manager = Manager()
return_dictionary = manager.dict()
jobs = []
for i in range(3000):
p = Process(target=bubbleSort, args=(i,myArray,return_dictionary))
jobs.append(p)
p.start()
for proc in jobs:
proc.join()
print return_dictionary.values()
The other way:
import os
import random
myArray = []
for i in range(1000):
myArray.append(random.randint(1,1000))
def getRandomSample(myset, sample_size):
sorted_list = sorted(random.sample(xrange(len(myset)), sample_size))
return([myset[i] for i in sorted_list])
def bubbleSort(alist):
sample_list = (getRandomSample(alist, 100))
for passnum in range(len(sample_list)-1,0,-1):
for i in range(passnum):
if sample_list[i]>alist[i+1]:
temp = alist[i]
sample_list[i] = alist[i+1]
sample_list[i+1] = temp
return(sample_list)
if __name__ == '__main__':
results = []
for i in range(3000):
results.append(bubbleSort(myArray))
print results

Multiprocessing is faster if you have multiple cores and do the parallelization properly. In your example you create 3000 processes which causes enormous amount on context switching between them. Instead of that use Pool to schedule the jobs for processes:
def bubbleSort(alist):
sample_list = (getRandomSample(alist, 100))
for passnum in range(len(sample_list)-1,0,-1):
for i in range(passnum):
if sample_list[i]>alist[i+1]:
temp = alist[i]
sample_list[i] = alist[i+1]
sample_list[i+1] = temp
return(sample_list)
if __name__ == '__main__':
pool = Pool(processes=4)
for x in pool.imap_unordered(bubbleSort, (myArray for x in range(3000))):
pass
I removed all the output and did some tests on my 4 core machine. As expected the code above was about 4 times faster than your sequential example.

Multiprocessing is not just magically faster. The thing is that your computer still has to do the same amount of work. It's like if you try to do multiple tasks at once, it's not going to be faster.
In a "normal" program, doing it sequential is easier to read and write (that it is that much faster too surprises me a little). Multiprocessing is especially useful if you have to wait for another process like a web request (you can send multiple at once and don't have to wait for each) or having some sort of event loop.
My guess as to why it is faster is that Python already uses multiprocessing internally wherever it makes sense (don't quote me on that). Also with threading it has to keep track of what is where, which means more overhead.
So, if we go back to the example in the real world, if you give a task to somebody else and instead of waiting for it, you do other things at the same time as them, then you are faster.

Related

Why pool class in multiprocessing is not giving advantage over linear processing

I am trying to measure the advantage of pool class in multiprocessing module over normal programming and I am calculating square of a number using a function . Now when I calculate the time take to find the square of all the three numbers it takes around ~0.24 sec but when I calculate it normally in a for loop it takes even less ~0.007 sec. Why is that? Shouldn't the code part with pool should be faster?
import time
from multiprocessing import Pool,Process
def f(x):
return x*x
if __name__ == '__main__':
start = time.time()
array = []
for i in range(1000000):
array.append(i)
with Pool(4) as p:
(p.map(f, array))
print(time.time()-start) # time taken when using pool
start1 = time.time()
for i in range(1000000):
f(array[i])
print(time.time()-start1) # time taken normaly

So as suggested by #klaus D. and #wwii I was not having enough computation to overcome the overhead of spawning processes and the time taken in switching between the processes.
Below is the updated code to notice the difference. Hope it helps
import multiprocessing
import time
import random
from multiprocessing import Pool,Process
def f(x):
time.sleep(3)
if __name__ == '__main__':
array = []
for i in range(4):
array.append(i)
start = time.time()
with Pool(4) as p:
(p.map(f, array))
print(time.time()-start) # time taken when using pool
start1 = time.time()
for i in range(4):
f(array[i])
print(time.time()-start1) # time taken normaly

The problem is that your function for workers in pool is too simple to be improved by parallelism:
try this:
import time
from multiprocessing import Pool,Process
N = 80
M = 1_000_000
def f_std(array):
"""
Calculate Standard deviation
"""
mean = sum(array)/len(array)
std = ((sum(map(lambda x: (x-mean)**2, array)))/len(array))**.5
return std
if __name__ == '__main__':
array = []
for i in range(N):
array.append(range(M))
start = time.time()
with Pool(8) as p:
(p.map(f_std, array))
print(time.time()-start) # time taken when using pool
start1 = time.time()
for i in range(N):
f_std(array[i])
print(time.time()-start1) # time taken normaly

How python manager.dict() locking works:

A managers.dict() allow to share a dictionary across process and perform thread-safe operation.
In my case each a coordinator process create the shared dict with m elements and n worker processes read and write to/from a single dict key.
Do managers.dict() have one single lock for the dict or m locks, one for every key in it?
Is there an alternative way to share m elements to n workers, other than a shared dict, when the workers do not have to communicate with each other?
Related python-manager-dict-is-very-slow-compared-to-regular-dict

After some tries I can say there is only one lock per managers dict.
Here is the code that proves it:
import time
import multiprocessing as mp
def process_f(key, shared_dict):
values = [i for i in range(64 * 1024 * 1024)]
print "Writing {}...".format(key)
a = time.time()
shared_dict[key] = values
b = time.time()
print "released {} in {}ms".format(key, (b-a)*1000)
def main():
process_manager = mp.Manager()
n = 5
keys = [i for i in range(n)]
shared_dict = process_manager.dict({i: i * i for i in keys})
pool = mp.Pool(processes=n)
for i in range(n):
pool.apply_async(process_f, (keys[i], shared_dict))
time.sleep(20)
if __name__ == '__main__':
main()
output:
Writing 4...
Writing 3...
Writing 1...
Writing 2...
Writing 0...
released 4 in 3542.7968502ms
released 0 in 4416.22900963ms
released 1 in 6247.48706818ms
released 2 in 7926.97191238ms
released 3 in 9973.71196747ms
Process finished with exit code 0
The increasing time for writing show the waiting which is happening.

Is IPC slowing this down?

I understand that there is overhead when using the Multiprocessing module, but this seems to be a high amount and the level of IPC should be fairly low from what I can gather.
Say I generate a large-ish list of random numbers between 1-1000 and want to obtain a list of only the prime numbers. This code is only meant to test multiprocessing on CPU-intensive tasks. Ignore the overall inefficiency of the primality test.
The bulk of the code may look something like this:
from random import SystemRandom
from math import sqrt
from timeit import default_timer as time
from multiprocessing import Pool, Process, Manager, cpu_count
rdev = SystemRandom()
NUM_CNT = 0x5000
nums = [rdev.randint(0, 1000) for _ in range(NUM_CNT)]
primes = []
def chunk(l, n):
i = int(len(l)/float(n))
for j in range(0, n-1):
yield l[j*i:j*i+i]
yield l[n*i-i:]
def is_prime(n):
if n <= 2: return True
if not n % 2: return False
for i in range(3, int(sqrt(n)) + 1, 2):
if n % i == 0:
return False
return True
It seems to me that I should be able to split this up among multiple processes. I have 8 logical cores, so I should be able to use cpu_count() as the # of processes.
Serial:
def serial():
global primes
primes = []
for num in nums:
if is_prime(num):
primes.append(num) # primes now contain all the values
The following sizes of NUM_CNT correspond to the speed:
0x500 = 0.00100 sec.
0x5000 = 0.01723 sec.
0x50000 = 0.27573 sec.
0x500000 = 4.31746 sec.
This was the way I chose to do the multiprocessing. It uses the chunk() function to split up nums into cpu_count() (roughly equal) parts. It passes each chunk into a new process, which iterates through them, and then assigns it to an entry of a shared dict variable. The IPC should really occur when I assign the value to the shared variable. Why would it occur otherwise?
def loop(ret, id, numbers):
l_primes = []
for num in numbers:
if is_prime(num):
l_primes.append(num)
ret[id] = l_primes
def parallel():
man = Manager()
ret = man.dict()
num_procs = cpu_count()
procs = []
for i, l in enumerate(chunk(nums, num_procs)):
p = Process(target=loop, args=(ret, i, l))
p.daemon = True
p.start()
procs.append(p)
[proc.join() for proc in procs]
return sum(ret.values(), [])
Again, I expect some overhead, but the time seems to be increasing exponentially faster than the serial version.
0x500 = 0.37199 sec.
0x5000 = 0.91906 sec.
0x50000 = 8.38845 sec.
0x500000 = 119.37617 sec.
What is causing it to do this? Is it IPC? The initial setup makes me expect some overhead, but this is just an insane amount.
Edit:
Here's how I'm timing the execution of the functions:
if __name__ == '__main__':
print(hex(NUM_CNT))
for func in (serial, parallel):
t1 = time()
vals = func()
t2 = time()
if vals is None: # serial has no return value
print(len(primes))
else: # but parallel does
print(len(vals))
print("Took {:.05f} sec.".format(t2 - t1))
The same list of numbers is used each time.
Example output:
0x5000
3442
Took 0.01828 sec.
3442
Took 0.93016 sec.

Hmm. How do you measure time? On my computer, the parallel version is much faster than the serial one.
I'm mesuring using time.time() that way: if we assume tt is an alias for time.time().
serial()
t2 = int(round(tt() * 1000))
print(t2 - t1)
parallel()
t3 = int(round(tt() * 1000))
print(t3-t2)
I get, with 0x500000 as input:
5519ms for the serial version
3351ms for the parallel version
I believe that your mistake is caused by the inclusion of the number generation process inside the parallel, but not inside the serial one.
On my computer, the generation of the random numbers takes like 45seconds (it's a very slow process). So, it can explain the difference between your two values as I don't think that my computer uses a very different architecture.

firing a sequence of parallel tasks

For this dask code:
def inc(x):
return x + 1
for x in range(5):
array[x] = delay(inc)(x)
I want to access all the elements in array by executing the delayed tasks. But I can't call array.compute() since array is not a function. If I do
for x in range(5):
array[x].compute()
then does each task gets executed in parallel or does a[1] get fired only after a[0] terminates? Is there a better way to write this code?

You can use the dask.compute function to compute many delayed values at once
from dask import delayed, compute
array = [delayed(inc)(i) for i in range(5)]
result = compute(*array)

It's easy to tell if things are executing in parallel if you force them to take a long time. If you run this code:
from time import sleep, time
from dask import delayed
start = time()
def inc(x):
sleep(1)
print('[inc(%s): %s]' % (x, time() - start))
return x + 1
array = [0] * 5
for x in range(5):
array[x] = delayed(inc)(x)
for x in range(5):
array[x].compute()
It becomes very obvious that the calls happen in sequence. However if you replace the last loop with this:
delayed(array).compute()
then you can see that they are in parallel. On my machine the output looks like this:
[inc(1): 1.00373506546]
[inc(4): 1.00429320335]
[inc(2): 1.00471806526]
[inc(3): 1.00475406647]
[inc(0): 2.00795912743]
Clearly the first four tasks it executed were in parallel. Presumably the default parallelism is set to the number of cores on the machine, because for CPU intensive tasks it's not generally useful to have more.

Multiprocessing and lists

I have been trying to optimise my code using the multiprocessing module, but I think I have fallen for the trap of premature optimization.
For example, when running this code:
num = 1000000
l = mp.Manager().list()
for i in range(num):
l.append(i)
l_ = Counter(l)
It takes several times longer than this:
num = 1000000
l = []
for i in range(num):
l.append(i)
l_ = Counter(l)
What is the reason the multiprocessing list is slower than regular lists? And are there ways to make them as efficient?

Shared memroy data structures are meant to be shared between processes. To synchronize accesses, they need to be locked. On the other hand, a list ([]) does not require a lock.
With / without locking makes a difference.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Multiprocessing in Python not faster than doing it sequentially - python

Related

Why pool class in multiprocessing is not giving advantage over linear processing

How python manager.dict() locking works:

Is IPC slowing this down?

firing a sequence of parallel tasks

Multiprocessing and lists

Categories

Resources