How python manager.dict() locking works: - python

A managers.dict() allow to share a dictionary across process and perform thread-safe operation.
In my case each a coordinator process create the shared dict with m elements and n worker processes read and write to/from a single dict key.
Do managers.dict() have one single lock for the dict or m locks, one for every key in it?
Is there an alternative way to share m elements to n workers, other than a shared dict, when the workers do not have to communicate with each other?
Related python-manager-dict-is-very-slow-compared-to-regular-dict

After some tries I can say there is only one lock per managers dict.
Here is the code that proves it:
import time
import multiprocessing as mp
def process_f(key, shared_dict):
values = [i for i in range(64 * 1024 * 1024)]
print "Writing {}...".format(key)
a = time.time()
shared_dict[key] = values
b = time.time()
print "released {} in {}ms".format(key, (b-a)*1000)
def main():
process_manager = mp.Manager()
n = 5
keys = [i for i in range(n)]
shared_dict = process_manager.dict({i: i * i for i in keys})
pool = mp.Pool(processes=n)
for i in range(n):
pool.apply_async(process_f, (keys[i], shared_dict))
time.sleep(20)
if __name__ == '__main__':
main()
output:
Writing 4...
Writing 3...
Writing 1...
Writing 2...
Writing 0...
released 4 in 3542.7968502ms
released 0 in 4416.22900963ms
released 1 in 6247.48706818ms
released 2 in 7926.97191238ms
released 3 in 9973.71196747ms
Process finished with exit code 0
The increasing time for writing show the waiting which is happening.

Related

Python dynamic control of the number of processes in a multiprocessing script according to the amount of free RAM and of a parameter from the function

I've got an unusual question for python. I'm using the multiprocessing library to map a function f((dynamic1, dynamic2), fix1, fix2).
import multiprocessing as mp
fix1 = 4
fix2 = 6
# Numer of cores to use
N = 6
dynamic_duos = [(a, b) for a in range(5) for b in range(10)]
with mp.Pool(processes = N) as p:
p.starmap(f, [(dyn, fix1, fix2) for dyn in dynamic_duos])
I would like to control dynamically the number of active processes because the function is actually pumping sometimes a LOT of RAM. The idea would be to check at every iteration (i.e. before any call of the function f) if the sum(dyn) is inferior to a threshold and if the amount of RAM is above a threshold. If the conditions are matched, then a new process can start and compute the function.
An additional condition would be the maximum number of processes: the number of cores on the PC.
Thanks for the help :)
Edit: Details on the reasons.
Some of the combinations of parameters will have a high RAM consumption (up to 80 Gb on 1 process). I know more or less which ones will use a lot of RAM, and when the program encounters them, I would like to wait for the other process to end, start in single process this high RAM consumption combination, and then resume the computation with more processes on the rest of the combination to map.
Edit on my try based on the answer below:
It doesn't work, but it doesn't raise an error. It just completes the program.
# Imports
import itertools
import concurrent.futures
# Parameters
N = int(input("Number of CPUs to use: "))
t0 = 0
tf = 200
s_step = 0.05
max_s = None
folder = "test"
possible_dynamics = [My_class(x) for x in [20, 30, 40, 50, 60]]
dynamics_to_compute = [list(x) for x in itertools.combinations_with_replacement(possible_dynamics , 2)] + [list(x) for x in itertools.combinations_with_replacement(possible_dynamics , 3)]
function_inputs = [(dyn , t0, tf, s_step, max_s, folder) for dyn in dynamics_to_compute]
# -----------
# Computation
# -----------
start = time.time()
# Pool creation and computation
futures = []
pool = concurrent.futures.ProcessPoolExecutor(max_workers = N)
for Obj, t0, tf, s_step, max_s, folder in function_inputs:
if large_memory(Obj, s_step, max_s):
concurrent.futures.wait(futures) # wait for all pending tasks
large_future = pool.submit(compute, Obj, t0, tf,
s_step, max_s, folder)
large_future.result() # wait for large computation to finish
else:
future = pool.submit(compute, Obj, t0, tf,
s_step, max_s, folder)
futures.append(future)
end = time.time()
if round(end-start, 3) < 60:
print ("Complete - Elapsed time: {} s".format(round(end-start,3)))
else:
print ("Complete - Elapsed time: {} mn and {} s".format(int((end-start)//60), round((end-start)%60,3)))
os.system("pause")
This is still a simplified example of my code, but the idea is here. It runs in less than 0.2 s, which means he actually never called the function compute.
N.B: Obj is not the actual variable name.
To achieve so you need to give up on the use of map to gain more control on the execution flow of your tasks.
This code implements the algorithm you described at the end of your question. I'd recommend using concurrent.futures library as it expose a more neat set of APIs.
import concurrent.futures
pool = concurrent.futures.ProcessPoolExecutor(max_workers=6)
futures = []
for dyn, fix1, fix2 in dynamic_duos:
if large_memory(dyn, fix1, fix2):
concurrent.futures.wait(futures) # wait for all pending tasks
large_future = pool.submit(f, dyn, fix1, fix2)
large_future.result() # wait for large computation to finish
else:
future = pool.submit(f, dyn, fix1, fix2)
futures.append(future)

Is IPC slowing this down?

I understand that there is overhead when using the Multiprocessing module, but this seems to be a high amount and the level of IPC should be fairly low from what I can gather.
Say I generate a large-ish list of random numbers between 1-1000 and want to obtain a list of only the prime numbers. This code is only meant to test multiprocessing on CPU-intensive tasks. Ignore the overall inefficiency of the primality test.
The bulk of the code may look something like this:
from random import SystemRandom
from math import sqrt
from timeit import default_timer as time
from multiprocessing import Pool, Process, Manager, cpu_count
rdev = SystemRandom()
NUM_CNT = 0x5000
nums = [rdev.randint(0, 1000) for _ in range(NUM_CNT)]
primes = []
def chunk(l, n):
i = int(len(l)/float(n))
for j in range(0, n-1):
yield l[j*i:j*i+i]
yield l[n*i-i:]
def is_prime(n):
if n <= 2: return True
if not n % 2: return False
for i in range(3, int(sqrt(n)) + 1, 2):
if n % i == 0:
return False
return True
It seems to me that I should be able to split this up among multiple processes. I have 8 logical cores, so I should be able to use cpu_count() as the # of processes.
Serial:
def serial():
global primes
primes = []
for num in nums:
if is_prime(num):
primes.append(num) # primes now contain all the values
The following sizes of NUM_CNT correspond to the speed:
0x500 = 0.00100 sec.
0x5000 = 0.01723 sec.
0x50000 = 0.27573 sec.
0x500000 = 4.31746 sec.
This was the way I chose to do the multiprocessing. It uses the chunk() function to split up nums into cpu_count() (roughly equal) parts. It passes each chunk into a new process, which iterates through them, and then assigns it to an entry of a shared dict variable. The IPC should really occur when I assign the value to the shared variable. Why would it occur otherwise?
def loop(ret, id, numbers):
l_primes = []
for num in numbers:
if is_prime(num):
l_primes.append(num)
ret[id] = l_primes
def parallel():
man = Manager()
ret = man.dict()
num_procs = cpu_count()
procs = []
for i, l in enumerate(chunk(nums, num_procs)):
p = Process(target=loop, args=(ret, i, l))
p.daemon = True
p.start()
procs.append(p)
[proc.join() for proc in procs]
return sum(ret.values(), [])
Again, I expect some overhead, but the time seems to be increasing exponentially faster than the serial version.
0x500 = 0.37199 sec.
0x5000 = 0.91906 sec.
0x50000 = 8.38845 sec.
0x500000 = 119.37617 sec.
What is causing it to do this? Is it IPC? The initial setup makes me expect some overhead, but this is just an insane amount.
Edit:
Here's how I'm timing the execution of the functions:
if __name__ == '__main__':
print(hex(NUM_CNT))
for func in (serial, parallel):
t1 = time()
vals = func()
t2 = time()
if vals is None: # serial has no return value
print(len(primes))
else: # but parallel does
print(len(vals))
print("Took {:.05f} sec.".format(t2 - t1))
The same list of numbers is used each time.
Example output:
0x5000
3442
Took 0.01828 sec.
3442
Took 0.93016 sec.
Hmm. How do you measure time? On my computer, the parallel version is much faster than the serial one.
I'm mesuring using time.time() that way: if we assume tt is an alias for time.time().
serial()
t2 = int(round(tt() * 1000))
print(t2 - t1)
parallel()
t3 = int(round(tt() * 1000))
print(t3-t2)
I get, with 0x500000 as input:
5519ms for the serial version
3351ms for the parallel version
I believe that your mistake is caused by the inclusion of the number generation process inside the parallel, but not inside the serial one.
On my computer, the generation of the random numbers takes like 45seconds (it's a very slow process). So, it can explain the difference between your two values as I don't think that my computer uses a very different architecture.

Multiprocessing in Python not faster than doing it sequentially

I want to do something parallelly but it always goes slower. I put an example of two code snippets which can be compared. The multiprocessing way needs 12 seconds on my laptop. The sequential way only 3 seconds. I thought multiprocessing is faster.
I know that the task in this way does not make any sense but it is just made to compare the two ways. I know bubble sort can be replaced by faster ways.
Thanks.
Multiprocessing way:
from multiprocessing import Process, Manager
import os
import random
myArray = []
for i in range(1000):
myArray.append(random.randint(1,1000))
def getRandomSample(myset, sample_size):
sorted_list = sorted(random.sample(xrange(len(myset)), sample_size))
return([myset[i] for i in sorted_list])
def bubbleSort(iterator,alist, return_dictionary):
sample_list = (getRandomSample(alist, 100))
for passnum in range(len(sample_list)-1,0,-1):
for i in range(passnum):
if sample_list[i]>alist[i+1]:
temp = alist[i]
sample_list[i] = alist[i+1]
sample_list[i+1] = temp
return_dictionary[iterator] = sample_list
if __name__ == '__main__':
manager = Manager()
return_dictionary = manager.dict()
jobs = []
for i in range(3000):
p = Process(target=bubbleSort, args=(i,myArray,return_dictionary))
jobs.append(p)
p.start()
for proc in jobs:
proc.join()
print return_dictionary.values()
The other way:
import os
import random
myArray = []
for i in range(1000):
myArray.append(random.randint(1,1000))
def getRandomSample(myset, sample_size):
sorted_list = sorted(random.sample(xrange(len(myset)), sample_size))
return([myset[i] for i in sorted_list])
def bubbleSort(alist):
sample_list = (getRandomSample(alist, 100))
for passnum in range(len(sample_list)-1,0,-1):
for i in range(passnum):
if sample_list[i]>alist[i+1]:
temp = alist[i]
sample_list[i] = alist[i+1]
sample_list[i+1] = temp
return(sample_list)
if __name__ == '__main__':
results = []
for i in range(3000):
results.append(bubbleSort(myArray))
print results
Multiprocessing is faster if you have multiple cores and do the parallelization properly. In your example you create 3000 processes which causes enormous amount on context switching between them. Instead of that use Pool to schedule the jobs for processes:
def bubbleSort(alist):
sample_list = (getRandomSample(alist, 100))
for passnum in range(len(sample_list)-1,0,-1):
for i in range(passnum):
if sample_list[i]>alist[i+1]:
temp = alist[i]
sample_list[i] = alist[i+1]
sample_list[i+1] = temp
return(sample_list)
if __name__ == '__main__':
pool = Pool(processes=4)
for x in pool.imap_unordered(bubbleSort, (myArray for x in range(3000))):
pass
I removed all the output and did some tests on my 4 core machine. As expected the code above was about 4 times faster than your sequential example.
Multiprocessing is not just magically faster. The thing is that your computer still has to do the same amount of work. It's like if you try to do multiple tasks at once, it's not going to be faster.
In a "normal" program, doing it sequential is easier to read and write (that it is that much faster too surprises me a little). Multiprocessing is especially useful if you have to wait for another process like a web request (you can send multiple at once and don't have to wait for each) or having some sort of event loop.
My guess as to why it is faster is that Python already uses multiprocessing internally wherever it makes sense (don't quote me on that). Also with threading it has to keep track of what is where, which means more overhead.
So, if we go back to the example in the real world, if you give a task to somebody else and instead of waiting for it, you do other things at the same time as them, then you are faster.

Deadlock with multiprocessing module

I have a function that without multiprocessing loops over an array with 3-tuples and does some calculation. This array can be really long (>1million entries) so I thought using several processes could help speed things up.
I start with a list of points (random_points) with which I create a permutation of all possible triples (combList). This combList then is passed to my function.
The basic code I have works but only when the random_points list has 18 entries or less.
from scipy import stats
import itertools
import multiprocessing as mp
def calc3PointsList( points,output ):
xy = []
r = []
for point in points:
// do stuff with points and append results to xy and r
output.put( (xy, r) )
output = mp.Queue()
random_points = [ (np.array((stats.uniform(-0.5,1).rvs(),stats.uniform(-0.5,1).rvs()))) for _ in range(18)]
combList = list(itertools.combinations(random_points, 3))
N = 6
processes = [mp.Process(target=calc3PointsList, args=(combList[(i-1)*len(combList)/(N-1):i*len(combList)/(N-1)],output)) for i in range(1,N)]
for p in processes:
p.start()
for p in processes:
p.join()
results = [output.get() for p in processes]
As soon as the length of the random_points list is longer than 18 the program seems to go into a deadlock. With 18 and lower it just finishes fine. Am I using this whole multiprocessing module the wrong way?
OK, the problem is described in the programming guideline mentioned by user2667217:
Bear in mind that a process that has put items in a queue will wait before terminating until all the buffered items are fed by the “feeder” thread to the underlying pipe. (The child process can call the Queue.cancel_join_thread method of the queue to avoid this behaviour.)
This means that whenever you use a queue you need to make sure that all items which have been put on the queue will eventually be removed before the process is joined. Otherwise you cannot be sure that processes which have put items on the queue will terminate. Remember also that non-daemonic processes will be joined automatically.
Removing the join operation made it work. Also the right way to retrieve processes seems to be:
results = [output.get() for p in processes]
I do see anything else you posted that is clearly wrong but there is one thing you should definitely do : start new processes in a if __name__=="main": block, see programming guideline.
from scipy import stats
import itertools
import multiprocessing as mp
def calc3PointsList( points,output ):
xy = []
r = []
for point in points:
// do stuff with points and append results to xy and r
output.put( (xy, r) )
if __name__ == "__main__":
output = mp.Queue()
random_points = [ (np.array((stats.uniform(-0.5,1).rvs(),stats.uniform(-0.5,1).rvs()))) for _ in range(18)]
combList = list(itertools.combinations(random_points, 3))
N = 6
processes = [mp.Process(target=calc3PointsList, args=(combList[(i-1)*len(combList)/(N-1):i*len(combList)/(N-1)],output)) for i in range(1,N)]
for p in processes:
p.start()
for p in processes:
p.join()
results = [output.get for x in range(output.qsize())]

multiprocessing full capacity in Python

I wrote the following code which call function (compute_cluster) 6 times in parallel (each run of this function is independent of the other run and each run write the results in a separate file), the following is my code:
global L
for L in range(6,24):
pool = Pool(6)
pool.map(compute_cluster,range(1,3))
pool.close()
if __name__ == "__main__":
main(sys.argv)
despite the fact that I'm running this code on a I7 processor machine, and no matter how much I set the Pool to it's always running only two processes in parallel so is there any suggestion on how can I run 6 processes in parallel? such that the first three processes use L=6 and call compute_cluster with parameter values from 1:3 in parallel and at the same time the other three processes run the same function with the same parameter values but this time the Global L value is 7 ?
any suggestions is highly appreciated
There are a few things wrong here. First, as to why you always only have 2 processes going at a time -- The reason is because range(1, 3) only returns 2 values. So you're only giving the pool 2 tasks to do before you close it.
The second issue is that you're relying on global state. In this case, the code probably works, but it's limiting your performance since it is the factor which is preventing you from using all your cores. I would parallelize the L loop rather than the "inner" range loop. Something like1:
def wrapper(tup):
l, r = tup
# Even better would be to get rid of `L` and pass it to compute_cluster
global L
L = l
compute_cluster(r)
for r in range(1, 3):
p = Pool(6)
p.map(wrapper, [(l, r) for l in range(6, 24)])
p.close()
This works with the global L because each spawned process picks up its own copy of L -- It doesn't get shared between processes.
1Untested code
As pointed out in the comments, we can even pull the Pool out of the loop:
p = Pool(6)
p.map(wrapper, [(l, r) for l in range(6, 24) for r in range(1, 3)])
p.close()

Categories