I wrote the following code which call function (compute_cluster) 6 times in parallel (each run of this function is independent of the other run and each run write the results in a separate file), the following is my code:
global L
for L in range(6,24):
pool = Pool(6)
pool.map(compute_cluster,range(1,3))
pool.close()
if __name__ == "__main__":
main(sys.argv)
despite the fact that I'm running this code on a I7 processor machine, and no matter how much I set the Pool to it's always running only two processes in parallel so is there any suggestion on how can I run 6 processes in parallel? such that the first three processes use L=6 and call compute_cluster with parameter values from 1:3 in parallel and at the same time the other three processes run the same function with the same parameter values but this time the Global L value is 7 ?
any suggestions is highly appreciated
There are a few things wrong here. First, as to why you always only have 2 processes going at a time -- The reason is because range(1, 3) only returns 2 values. So you're only giving the pool 2 tasks to do before you close it.
The second issue is that you're relying on global state. In this case, the code probably works, but it's limiting your performance since it is the factor which is preventing you from using all your cores. I would parallelize the L loop rather than the "inner" range loop. Something like1:
def wrapper(tup):
l, r = tup
# Even better would be to get rid of `L` and pass it to compute_cluster
global L
L = l
compute_cluster(r)
for r in range(1, 3):
p = Pool(6)
p.map(wrapper, [(l, r) for l in range(6, 24)])
p.close()
This works with the global L because each spawned process picks up its own copy of L -- It doesn't get shared between processes.
1Untested code
As pointed out in the comments, we can even pull the Pool out of the loop:
p = Pool(6)
p.map(wrapper, [(l, r) for l in range(6, 24) for r in range(1, 3)])
p.close()
Related
I am currently redesigning a program to use Python's multiprocessing pools. My first impression was that the execution time increased instead of decreased. Therefore, I got curious and wrote a little test script:
import time
import multiprocessing
def simple(x):
return 2*x
def less_simple(x):
b = x
for i in range(0, 100):
b = b * i
return 2*x
a = list(range(0,1000000))
print("without multiprocessing:")
before = time.time()
res = map(simple, a)
after = time.time()
print(str(after - before))
print("-----")
print("with multiprocessing:")
for i in range(1, 5):
before = time.time()
with multiprocessing.Pool(processes=i) as pool:
pool.map(simple, a)
after = time.time()
print(str(i) + " processes: " + str(after - before))
I get the following results:
without multiprocessing:
2.384185791015625e-06
with multiprocessing:
1 processes: 0.35068225860595703
2 processes: 0.21297240257263184
3 processes: 0.21887946128845215
4 processes: 0.3474385738372803
When I replace simple with less_simple in lines 21 and 31, I get the following results:
without multiprocessing:
2.6226043701171875e-06
with multiprocessing:
1 processes: 3.1453816890716553
2 processes: 1.615351676940918
3 processes: 1.6125438213348389
4 processes: 1.5159809589385986
Honestly, I am a bit confused because the non-multiprocessing version is always some orders of magnitudes faster. Additionally, an increase of the process number seems to have little to no influence on the runtime. Therefore, I have a few questions:
Do I make some mistake in the usage of multiprocessing?
Are my test functions to simple to get a positive impact from multiprocessing?
Is there a chance to estimate at which point multiprocessing has an advantage or do I have to test it?
I did some more research and basically, you are right. Both functions are rather small and somewhat artificial. However, there is a measurable time difference between non-multiprocessing and multiprocessing even for those functions, when you take into consideration how map works. The map function only returns an iterator yielding the results [1], i.e., in the above example, it only creates the iterator which is of course very fast.
Therefore, I replaced the map function with a traditional for loop:
for elem in a:
res = simple(a)
For the simple function, the execution is still faster without multiprocessing because the overhead is too big for such a small function:
without multiprocessing:
0.1392803192138672
with multiprocessing:
1 processes: 0.38080787658691406
2 processes: 0.22507309913635254
3 processes: 0.21307945251464844
4 processes: 0.2152390480041504
However, in case of the function less_simple, you can see an actual advantage of multiprocessing:
without multiprocessing:
3.2029929161071777
with multiprocessing:
1 processes: 3.4934208393096924
2 processes: 1.8259460926055908
3 processes: 1.9196875095367432
4 processes: 1.716357946395874
[1] https://docs.python.org/3/library/functions.html#map
For this dask code:
def inc(x):
return x + 1
for x in range(5):
array[x] = delay(inc)(x)
I want to access all the elements in array by executing the delayed tasks. But I can't call array.compute() since array is not a function. If I do
for x in range(5):
array[x].compute()
then does each task gets executed in parallel or does a[1] get fired only after a[0] terminates? Is there a better way to write this code?
You can use the dask.compute function to compute many delayed values at once
from dask import delayed, compute
array = [delayed(inc)(i) for i in range(5)]
result = compute(*array)
It's easy to tell if things are executing in parallel if you force them to take a long time. If you run this code:
from time import sleep, time
from dask import delayed
start = time()
def inc(x):
sleep(1)
print('[inc(%s): %s]' % (x, time() - start))
return x + 1
array = [0] * 5
for x in range(5):
array[x] = delayed(inc)(x)
for x in range(5):
array[x].compute()
It becomes very obvious that the calls happen in sequence. However if you replace the last loop with this:
delayed(array).compute()
then you can see that they are in parallel. On my machine the output looks like this:
[inc(1): 1.00373506546]
[inc(4): 1.00429320335]
[inc(2): 1.00471806526]
[inc(3): 1.00475406647]
[inc(0): 2.00795912743]
Clearly the first four tasks it executed were in parallel. Presumably the default parallelism is set to the number of cores on the machine, because for CPU intensive tasks it's not generally useful to have more.
I want to do something parallelly but it always goes slower. I put an example of two code snippets which can be compared. The multiprocessing way needs 12 seconds on my laptop. The sequential way only 3 seconds. I thought multiprocessing is faster.
I know that the task in this way does not make any sense but it is just made to compare the two ways. I know bubble sort can be replaced by faster ways.
Thanks.
Multiprocessing way:
from multiprocessing import Process, Manager
import os
import random
myArray = []
for i in range(1000):
myArray.append(random.randint(1,1000))
def getRandomSample(myset, sample_size):
sorted_list = sorted(random.sample(xrange(len(myset)), sample_size))
return([myset[i] for i in sorted_list])
def bubbleSort(iterator,alist, return_dictionary):
sample_list = (getRandomSample(alist, 100))
for passnum in range(len(sample_list)-1,0,-1):
for i in range(passnum):
if sample_list[i]>alist[i+1]:
temp = alist[i]
sample_list[i] = alist[i+1]
sample_list[i+1] = temp
return_dictionary[iterator] = sample_list
if __name__ == '__main__':
manager = Manager()
return_dictionary = manager.dict()
jobs = []
for i in range(3000):
p = Process(target=bubbleSort, args=(i,myArray,return_dictionary))
jobs.append(p)
p.start()
for proc in jobs:
proc.join()
print return_dictionary.values()
The other way:
import os
import random
myArray = []
for i in range(1000):
myArray.append(random.randint(1,1000))
def getRandomSample(myset, sample_size):
sorted_list = sorted(random.sample(xrange(len(myset)), sample_size))
return([myset[i] for i in sorted_list])
def bubbleSort(alist):
sample_list = (getRandomSample(alist, 100))
for passnum in range(len(sample_list)-1,0,-1):
for i in range(passnum):
if sample_list[i]>alist[i+1]:
temp = alist[i]
sample_list[i] = alist[i+1]
sample_list[i+1] = temp
return(sample_list)
if __name__ == '__main__':
results = []
for i in range(3000):
results.append(bubbleSort(myArray))
print results
Multiprocessing is faster if you have multiple cores and do the parallelization properly. In your example you create 3000 processes which causes enormous amount on context switching between them. Instead of that use Pool to schedule the jobs for processes:
def bubbleSort(alist):
sample_list = (getRandomSample(alist, 100))
for passnum in range(len(sample_list)-1,0,-1):
for i in range(passnum):
if sample_list[i]>alist[i+1]:
temp = alist[i]
sample_list[i] = alist[i+1]
sample_list[i+1] = temp
return(sample_list)
if __name__ == '__main__':
pool = Pool(processes=4)
for x in pool.imap_unordered(bubbleSort, (myArray for x in range(3000))):
pass
I removed all the output and did some tests on my 4 core machine. As expected the code above was about 4 times faster than your sequential example.
Multiprocessing is not just magically faster. The thing is that your computer still has to do the same amount of work. It's like if you try to do multiple tasks at once, it's not going to be faster.
In a "normal" program, doing it sequential is easier to read and write (that it is that much faster too surprises me a little). Multiprocessing is especially useful if you have to wait for another process like a web request (you can send multiple at once and don't have to wait for each) or having some sort of event loop.
My guess as to why it is faster is that Python already uses multiprocessing internally wherever it makes sense (don't quote me on that). Also with threading it has to keep track of what is where, which means more overhead.
So, if we go back to the example in the real world, if you give a task to somebody else and instead of waiting for it, you do other things at the same time as them, then you are faster.
I have a function that without multiprocessing loops over an array with 3-tuples and does some calculation. This array can be really long (>1million entries) so I thought using several processes could help speed things up.
I start with a list of points (random_points) with which I create a permutation of all possible triples (combList). This combList then is passed to my function.
The basic code I have works but only when the random_points list has 18 entries or less.
from scipy import stats
import itertools
import multiprocessing as mp
def calc3PointsList( points,output ):
xy = []
r = []
for point in points:
// do stuff with points and append results to xy and r
output.put( (xy, r) )
output = mp.Queue()
random_points = [ (np.array((stats.uniform(-0.5,1).rvs(),stats.uniform(-0.5,1).rvs()))) for _ in range(18)]
combList = list(itertools.combinations(random_points, 3))
N = 6
processes = [mp.Process(target=calc3PointsList, args=(combList[(i-1)*len(combList)/(N-1):i*len(combList)/(N-1)],output)) for i in range(1,N)]
for p in processes:
p.start()
for p in processes:
p.join()
results = [output.get() for p in processes]
As soon as the length of the random_points list is longer than 18 the program seems to go into a deadlock. With 18 and lower it just finishes fine. Am I using this whole multiprocessing module the wrong way?
OK, the problem is described in the programming guideline mentioned by user2667217:
Bear in mind that a process that has put items in a queue will wait before terminating until all the buffered items are fed by the “feeder” thread to the underlying pipe. (The child process can call the Queue.cancel_join_thread method of the queue to avoid this behaviour.)
This means that whenever you use a queue you need to make sure that all items which have been put on the queue will eventually be removed before the process is joined. Otherwise you cannot be sure that processes which have put items on the queue will terminate. Remember also that non-daemonic processes will be joined automatically.
Removing the join operation made it work. Also the right way to retrieve processes seems to be:
results = [output.get() for p in processes]
I do see anything else you posted that is clearly wrong but there is one thing you should definitely do : start new processes in a if __name__=="main": block, see programming guideline.
from scipy import stats
import itertools
import multiprocessing as mp
def calc3PointsList( points,output ):
xy = []
r = []
for point in points:
// do stuff with points and append results to xy and r
output.put( (xy, r) )
if __name__ == "__main__":
output = mp.Queue()
random_points = [ (np.array((stats.uniform(-0.5,1).rvs(),stats.uniform(-0.5,1).rvs()))) for _ in range(18)]
combList = list(itertools.combinations(random_points, 3))
N = 6
processes = [mp.Process(target=calc3PointsList, args=(combList[(i-1)*len(combList)/(N-1):i*len(combList)/(N-1)],output)) for i in range(1,N)]
for p in processes:
p.start()
for p in processes:
p.join()
results = [output.get for x in range(output.qsize())]
I have been trying to optimise my code using the multiprocessing module, but I think I have fallen for the trap of premature optimization.
For example, when running this code:
num = 1000000
l = mp.Manager().list()
for i in range(num):
l.append(i)
l_ = Counter(l)
It takes several times longer than this:
num = 1000000
l = []
for i in range(num):
l.append(i)
l_ = Counter(l)
What is the reason the multiprocessing list is slower than regular lists? And are there ways to make them as efficient?
Shared memroy data structures are meant to be shared between processes. To synchronize accesses, they need to be locked. On the other hand, a list ([]) does not require a lock.
With / without locking makes a difference.