I have a function that without multiprocessing loops over an array with 3-tuples and does some calculation. This array can be really long (>1million entries) so I thought using several processes could help speed things up.
I start with a list of points (random_points) with which I create a permutation of all possible triples (combList). This combList then is passed to my function.
The basic code I have works but only when the random_points list has 18 entries or less.
from scipy import stats
import itertools
import multiprocessing as mp
def calc3PointsList( points,output ):
xy = []
r = []
for point in points:
// do stuff with points and append results to xy and r
output.put( (xy, r) )
output = mp.Queue()
random_points = [ (np.array((stats.uniform(-0.5,1).rvs(),stats.uniform(-0.5,1).rvs()))) for _ in range(18)]
combList = list(itertools.combinations(random_points, 3))
N = 6
processes = [mp.Process(target=calc3PointsList, args=(combList[(i-1)*len(combList)/(N-1):i*len(combList)/(N-1)],output)) for i in range(1,N)]
for p in processes:
p.start()
for p in processes:
p.join()
results = [output.get() for p in processes]
As soon as the length of the random_points list is longer than 18 the program seems to go into a deadlock. With 18 and lower it just finishes fine. Am I using this whole multiprocessing module the wrong way?
OK, the problem is described in the programming guideline mentioned by user2667217:
Bear in mind that a process that has put items in a queue will wait before terminating until all the buffered items are fed by the “feeder” thread to the underlying pipe. (The child process can call the Queue.cancel_join_thread method of the queue to avoid this behaviour.)
This means that whenever you use a queue you need to make sure that all items which have been put on the queue will eventually be removed before the process is joined. Otherwise you cannot be sure that processes which have put items on the queue will terminate. Remember also that non-daemonic processes will be joined automatically.
Removing the join operation made it work. Also the right way to retrieve processes seems to be:
results = [output.get() for p in processes]
I do see anything else you posted that is clearly wrong but there is one thing you should definitely do : start new processes in a if __name__=="main": block, see programming guideline.
from scipy import stats
import itertools
import multiprocessing as mp
def calc3PointsList( points,output ):
xy = []
r = []
for point in points:
// do stuff with points and append results to xy and r
output.put( (xy, r) )
if __name__ == "__main__":
output = mp.Queue()
random_points = [ (np.array((stats.uniform(-0.5,1).rvs(),stats.uniform(-0.5,1).rvs()))) for _ in range(18)]
combList = list(itertools.combinations(random_points, 3))
N = 6
processes = [mp.Process(target=calc3PointsList, args=(combList[(i-1)*len(combList)/(N-1):i*len(combList)/(N-1)],output)) for i in range(1,N)]
for p in processes:
p.start()
for p in processes:
p.join()
results = [output.get for x in range(output.qsize())]
Related
A managers.dict() allow to share a dictionary across process and perform thread-safe operation.
In my case each a coordinator process create the shared dict with m elements and n worker processes read and write to/from a single dict key.
Do managers.dict() have one single lock for the dict or m locks, one for every key in it?
Is there an alternative way to share m elements to n workers, other than a shared dict, when the workers do not have to communicate with each other?
Related python-manager-dict-is-very-slow-compared-to-regular-dict
After some tries I can say there is only one lock per managers dict.
Here is the code that proves it:
import time
import multiprocessing as mp
def process_f(key, shared_dict):
values = [i for i in range(64 * 1024 * 1024)]
print "Writing {}...".format(key)
a = time.time()
shared_dict[key] = values
b = time.time()
print "released {} in {}ms".format(key, (b-a)*1000)
def main():
process_manager = mp.Manager()
n = 5
keys = [i for i in range(n)]
shared_dict = process_manager.dict({i: i * i for i in keys})
pool = mp.Pool(processes=n)
for i in range(n):
pool.apply_async(process_f, (keys[i], shared_dict))
time.sleep(20)
if __name__ == '__main__':
main()
output:
Writing 4...
Writing 3...
Writing 1...
Writing 2...
Writing 0...
released 4 in 3542.7968502ms
released 0 in 4416.22900963ms
released 1 in 6247.48706818ms
released 2 in 7926.97191238ms
released 3 in 9973.71196747ms
Process finished with exit code 0
The increasing time for writing show the waiting which is happening.
I am currently redesigning a program to use Python's multiprocessing pools. My first impression was that the execution time increased instead of decreased. Therefore, I got curious and wrote a little test script:
import time
import multiprocessing
def simple(x):
return 2*x
def less_simple(x):
b = x
for i in range(0, 100):
b = b * i
return 2*x
a = list(range(0,1000000))
print("without multiprocessing:")
before = time.time()
res = map(simple, a)
after = time.time()
print(str(after - before))
print("-----")
print("with multiprocessing:")
for i in range(1, 5):
before = time.time()
with multiprocessing.Pool(processes=i) as pool:
pool.map(simple, a)
after = time.time()
print(str(i) + " processes: " + str(after - before))
I get the following results:
without multiprocessing:
2.384185791015625e-06
with multiprocessing:
1 processes: 0.35068225860595703
2 processes: 0.21297240257263184
3 processes: 0.21887946128845215
4 processes: 0.3474385738372803
When I replace simple with less_simple in lines 21 and 31, I get the following results:
without multiprocessing:
2.6226043701171875e-06
with multiprocessing:
1 processes: 3.1453816890716553
2 processes: 1.615351676940918
3 processes: 1.6125438213348389
4 processes: 1.5159809589385986
Honestly, I am a bit confused because the non-multiprocessing version is always some orders of magnitudes faster. Additionally, an increase of the process number seems to have little to no influence on the runtime. Therefore, I have a few questions:
Do I make some mistake in the usage of multiprocessing?
Are my test functions to simple to get a positive impact from multiprocessing?
Is there a chance to estimate at which point multiprocessing has an advantage or do I have to test it?
I did some more research and basically, you are right. Both functions are rather small and somewhat artificial. However, there is a measurable time difference between non-multiprocessing and multiprocessing even for those functions, when you take into consideration how map works. The map function only returns an iterator yielding the results [1], i.e., in the above example, it only creates the iterator which is of course very fast.
Therefore, I replaced the map function with a traditional for loop:
for elem in a:
res = simple(a)
For the simple function, the execution is still faster without multiprocessing because the overhead is too big for such a small function:
without multiprocessing:
0.1392803192138672
with multiprocessing:
1 processes: 0.38080787658691406
2 processes: 0.22507309913635254
3 processes: 0.21307945251464844
4 processes: 0.2152390480041504
However, in case of the function less_simple, you can see an actual advantage of multiprocessing:
without multiprocessing:
3.2029929161071777
with multiprocessing:
1 processes: 3.4934208393096924
2 processes: 1.8259460926055908
3 processes: 1.9196875095367432
4 processes: 1.716357946395874
[1] https://docs.python.org/3/library/functions.html#map
With multiprocessing.Pool, there are code samples in the tutorials where you can set number of processes with cpu counts. Can you set the number of cpu's with the multiprocessing.Process method.
from multiprocessing import Process, Value, Array
def f(n, a):
n.value = 3.1415927
for i in range(len(a)):
a[i] = -a[i]
if __name__ == '__main__':
num = Value('d', 0.0)
arr = Array('i', range(10))
p = Process(target=f, args=(num, arr))
p.start()
p.join()
print(num.value)
print(arr[:])
Actually Process represents only one process which uses only one CPU (if you dont use threads) - it is up to you to create as many Processes as you need.
This means that you have to create as many Processes as you have CPUs to use all of them (possibly -1 if you are doing things in the main process)
You can read the number of CPUs with multiprocessing.cpu_count
I want to do something parallelly but it always goes slower. I put an example of two code snippets which can be compared. The multiprocessing way needs 12 seconds on my laptop. The sequential way only 3 seconds. I thought multiprocessing is faster.
I know that the task in this way does not make any sense but it is just made to compare the two ways. I know bubble sort can be replaced by faster ways.
Thanks.
Multiprocessing way:
from multiprocessing import Process, Manager
import os
import random
myArray = []
for i in range(1000):
myArray.append(random.randint(1,1000))
def getRandomSample(myset, sample_size):
sorted_list = sorted(random.sample(xrange(len(myset)), sample_size))
return([myset[i] for i in sorted_list])
def bubbleSort(iterator,alist, return_dictionary):
sample_list = (getRandomSample(alist, 100))
for passnum in range(len(sample_list)-1,0,-1):
for i in range(passnum):
if sample_list[i]>alist[i+1]:
temp = alist[i]
sample_list[i] = alist[i+1]
sample_list[i+1] = temp
return_dictionary[iterator] = sample_list
if __name__ == '__main__':
manager = Manager()
return_dictionary = manager.dict()
jobs = []
for i in range(3000):
p = Process(target=bubbleSort, args=(i,myArray,return_dictionary))
jobs.append(p)
p.start()
for proc in jobs:
proc.join()
print return_dictionary.values()
The other way:
import os
import random
myArray = []
for i in range(1000):
myArray.append(random.randint(1,1000))
def getRandomSample(myset, sample_size):
sorted_list = sorted(random.sample(xrange(len(myset)), sample_size))
return([myset[i] for i in sorted_list])
def bubbleSort(alist):
sample_list = (getRandomSample(alist, 100))
for passnum in range(len(sample_list)-1,0,-1):
for i in range(passnum):
if sample_list[i]>alist[i+1]:
temp = alist[i]
sample_list[i] = alist[i+1]
sample_list[i+1] = temp
return(sample_list)
if __name__ == '__main__':
results = []
for i in range(3000):
results.append(bubbleSort(myArray))
print results
Multiprocessing is faster if you have multiple cores and do the parallelization properly. In your example you create 3000 processes which causes enormous amount on context switching between them. Instead of that use Pool to schedule the jobs for processes:
def bubbleSort(alist):
sample_list = (getRandomSample(alist, 100))
for passnum in range(len(sample_list)-1,0,-1):
for i in range(passnum):
if sample_list[i]>alist[i+1]:
temp = alist[i]
sample_list[i] = alist[i+1]
sample_list[i+1] = temp
return(sample_list)
if __name__ == '__main__':
pool = Pool(processes=4)
for x in pool.imap_unordered(bubbleSort, (myArray for x in range(3000))):
pass
I removed all the output and did some tests on my 4 core machine. As expected the code above was about 4 times faster than your sequential example.
Multiprocessing is not just magically faster. The thing is that your computer still has to do the same amount of work. It's like if you try to do multiple tasks at once, it's not going to be faster.
In a "normal" program, doing it sequential is easier to read and write (that it is that much faster too surprises me a little). Multiprocessing is especially useful if you have to wait for another process like a web request (you can send multiple at once and don't have to wait for each) or having some sort of event loop.
My guess as to why it is faster is that Python already uses multiprocessing internally wherever it makes sense (don't quote me on that). Also with threading it has to keep track of what is where, which means more overhead.
So, if we go back to the example in the real world, if you give a task to somebody else and instead of waiting for it, you do other things at the same time as them, then you are faster.
I wrote the following code which call function (compute_cluster) 6 times in parallel (each run of this function is independent of the other run and each run write the results in a separate file), the following is my code:
global L
for L in range(6,24):
pool = Pool(6)
pool.map(compute_cluster,range(1,3))
pool.close()
if __name__ == "__main__":
main(sys.argv)
despite the fact that I'm running this code on a I7 processor machine, and no matter how much I set the Pool to it's always running only two processes in parallel so is there any suggestion on how can I run 6 processes in parallel? such that the first three processes use L=6 and call compute_cluster with parameter values from 1:3 in parallel and at the same time the other three processes run the same function with the same parameter values but this time the Global L value is 7 ?
any suggestions is highly appreciated
There are a few things wrong here. First, as to why you always only have 2 processes going at a time -- The reason is because range(1, 3) only returns 2 values. So you're only giving the pool 2 tasks to do before you close it.
The second issue is that you're relying on global state. In this case, the code probably works, but it's limiting your performance since it is the factor which is preventing you from using all your cores. I would parallelize the L loop rather than the "inner" range loop. Something like1:
def wrapper(tup):
l, r = tup
# Even better would be to get rid of `L` and pass it to compute_cluster
global L
L = l
compute_cluster(r)
for r in range(1, 3):
p = Pool(6)
p.map(wrapper, [(l, r) for l in range(6, 24)])
p.close()
This works with the global L because each spawned process picks up its own copy of L -- It doesn't get shared between processes.
1Untested code
As pointed out in the comments, we can even pull the Pool out of the loop:
p = Pool(6)
p.map(wrapper, [(l, r) for l in range(6, 24) for r in range(1, 3)])
p.close()