I understand that there is overhead when using the Multiprocessing module, but this seems to be a high amount and the level of IPC should be fairly low from what I can gather.
Say I generate a large-ish list of random numbers between 1-1000 and want to obtain a list of only the prime numbers. This code is only meant to test multiprocessing on CPU-intensive tasks. Ignore the overall inefficiency of the primality test.
The bulk of the code may look something like this:
from random import SystemRandom
from math import sqrt
from timeit import default_timer as time
from multiprocessing import Pool, Process, Manager, cpu_count
rdev = SystemRandom()
NUM_CNT = 0x5000
nums = [rdev.randint(0, 1000) for _ in range(NUM_CNT)]
primes = []
def chunk(l, n):
i = int(len(l)/float(n))
for j in range(0, n-1):
yield l[j*i:j*i+i]
yield l[n*i-i:]
def is_prime(n):
if n <= 2: return True
if not n % 2: return False
for i in range(3, int(sqrt(n)) + 1, 2):
if n % i == 0:
return False
return True
It seems to me that I should be able to split this up among multiple processes. I have 8 logical cores, so I should be able to use cpu_count() as the # of processes.
Serial:
def serial():
global primes
primes = []
for num in nums:
if is_prime(num):
primes.append(num) # primes now contain all the values
The following sizes of NUM_CNT correspond to the speed:
0x500 = 0.00100 sec.
0x5000 = 0.01723 sec.
0x50000 = 0.27573 sec.
0x500000 = 4.31746 sec.
This was the way I chose to do the multiprocessing. It uses the chunk() function to split up nums into cpu_count() (roughly equal) parts. It passes each chunk into a new process, which iterates through them, and then assigns it to an entry of a shared dict variable. The IPC should really occur when I assign the value to the shared variable. Why would it occur otherwise?
def loop(ret, id, numbers):
l_primes = []
for num in numbers:
if is_prime(num):
l_primes.append(num)
ret[id] = l_primes
def parallel():
man = Manager()
ret = man.dict()
num_procs = cpu_count()
procs = []
for i, l in enumerate(chunk(nums, num_procs)):
p = Process(target=loop, args=(ret, i, l))
p.daemon = True
p.start()
procs.append(p)
[proc.join() for proc in procs]
return sum(ret.values(), [])
Again, I expect some overhead, but the time seems to be increasing exponentially faster than the serial version.
0x500 = 0.37199 sec.
0x5000 = 0.91906 sec.
0x50000 = 8.38845 sec.
0x500000 = 119.37617 sec.
What is causing it to do this? Is it IPC? The initial setup makes me expect some overhead, but this is just an insane amount.
Edit:
Here's how I'm timing the execution of the functions:
if __name__ == '__main__':
print(hex(NUM_CNT))
for func in (serial, parallel):
t1 = time()
vals = func()
t2 = time()
if vals is None: # serial has no return value
print(len(primes))
else: # but parallel does
print(len(vals))
print("Took {:.05f} sec.".format(t2 - t1))
The same list of numbers is used each time.
Example output:
0x5000
3442
Took 0.01828 sec.
3442
Took 0.93016 sec.
Hmm. How do you measure time? On my computer, the parallel version is much faster than the serial one.
I'm mesuring using time.time() that way: if we assume tt is an alias for time.time().
serial()
t2 = int(round(tt() * 1000))
print(t2 - t1)
parallel()
t3 = int(round(tt() * 1000))
print(t3-t2)
I get, with 0x500000 as input:
5519ms for the serial version
3351ms for the parallel version
I believe that your mistake is caused by the inclusion of the number generation process inside the parallel, but not inside the serial one.
On my computer, the generation of the random numbers takes like 45seconds (it's a very slow process). So, it can explain the difference between your two values as I don't think that my computer uses a very different architecture.
Related
I have been trying to widen my understanding of python recently, so I decided to make a prime number calculator to work on my optimization skills. I have worked all day on this and have improved the time to go through 0-100,000 to .15 seconds from 25~30 ish seconds. However, I am dealing with exponential growth in difficulty as I go higher in the iterations.
My question now is; how I would implement multi-threading? I have tried following tutorials and trying to create my code in a modular fashion, with functions, but I have been banging my head against this problem for over four hours with no progress. Any help would be much appreciated
It may be a bit chaotic, but the idea here is that the main loop calls the range_primes function, which loops over a specified amount of numbers and checks whether they are prime, and returns a list with the values found to be prime from the range. My thought process was that I could break off "chunks" of the number line and feed it to different processes to efficiently manage resources.
One of the main problems I ran into was that I could not figure out how to append all of the lists returned by the processes to a master output list. Just thought of this now, what about writing to a file? I/O operations are slow, but it might be easier to write to a new line on a txt file than anything else.
maybe a class of some sort to hold a process and its outputs, which can then be queried for the values?
my current (not working) code:
from time import perf_counter
import math
import multiprocessing
import subprocess
#process a chunk of numbers given start and length
def chunk_process(index:int =0, chunk:int = 13) -> list:
tmplst = []
tmplst = range_primes(((index)*chunk)+1, chunk*(index+1))
return tmplst
#check if a given number is prime
def isPrime_brute(num:int) -> bool:
if num > 1:
for i in range(2, int(math.sqrt(num))+1):
if (num % i) == 0:
return False
else:
return True#only returns true if nothing found
else:
return False
#get which numbers are primes in a given range
def range_primes(min_num:int=0, max_num:int=100) -> list:
out_list = []
mid = 0
#print(f'starting a test to list primes between {max_num} and {min_num}')
# print('Notifications will be given at the halfway point')
for i in range(min_num, max_num):
if isPrime_brute(i):
out_list.append(i)
return out_list
#print(out_list)
if __name__ == "__main__":
#setup multiprocessing
pool = multiprocessing.Pool(multiprocessing.cpu_count())
cores = multiprocessing.cpu_count()
maxsent = 0
#setup other stuff
time_run = 30 #S
startPoint = 0
loop_num = 0
primes = []
tmplst = []
chunk_size = 7
max_assigned = 0
print(f"starting calculation for {time_run} seconds")
start_time = perf_counter()
#majic(intentional)
while (perf_counter()-start_time) <= time_run:
r = pool.map_async(chunk_process, [i for i in range(maxsent, maxsent+cores+1)])
maxsent = maxsent+cores+1
#outputs
elapsed = (perf_counter()-start_time)#seconds taken
print(r)
print(f"{elapsed} S were used to find {len(r)} prime numbers")
print(f"{startPoint+chunk_size} numbers were tried")
print(f"with {chunk_size} chunking")
I know that it is not very cleanly written, but I just pieced it together in an evening and am not very experienced.
This version will run for a certain amount of time and quit operations shortly after a threshold is reached. As far as the multiprocessing goes, It is a patchwork of stuff that I have tried from different tutorials/documentation. I don't really have any idea how to proceed. Any input would be much appreciated.
I am trying to streamline a program that involves a set of short tasks that can be done in parallel, where the results of the set of tasks must be compared before moving onto the next step (which again involves a set of short tasks, and then another set, etc.). Due to the level of complexity of these tasks, it's not worthwhile to use multiprocessing due to the set-up time. I am wondering if there is another way to do these short tasks in parallel that is faster than linear. The only question I can find on this site that describes this problem for Python references this answer on memory sharing which I don't think answers my question (or if it does I could not follow how).
To illustrate what I am hoping to do, consider the problem of summing a bunch of numbers from 0 to N. (Of course this can be solved analytically, my point is to come up with a low-memory but short CPU-intensive task.) First, the linear approach would simply be:
def numbers(a,b):
return(i for i in range(a,b))
def linear_sum(a):
return(sum(numbers(a[0],a[1])))
n = 2000
linear_sum([0, n+1])
#2001000
For threading, I want to break the problem into parts that can then be summed separately and then combined, so the idea would be to get a bunch of ranges over which to sum with something like
def get_ranges(i, Nprocess = 3):
di = i // Nprocess
j = np.append(np.arange(0, i, di), [i+1,])
return([(j[k], j[k+1]) for k in range(len(j)-1)])
and for some value n >> NProcesses the pseudocode example would be something like
values = get_ranges(n)
x = []
for value in values:
x.append(do_someting_parallel(value))
return(sum(x))
The question then, is how to implement do_someting_parallel? For multiprocessing, we can do something like:
from multiprocessing import Pool as ThreadPool
def mpc_thread_sum(i, Nprocess = 3):
values = get_ranges(i)
pool = ThreadPool(Nprocess)
results = pool.map(linear_sum, values)
pool.close()
pool.join()
return(sum(results))
print(mpc_thread_sum(2000))
# 2001000
The graph below shows the performance of the different approaches described. Is there a way to speed up computations for the region where multiprocessing is still slower than linear or is this the limit of parallelization in Python's GIL? I suspect the answer may be that I am hitting my limit but wanted to ask here to be sure. I tried multiprocessing.dummy, asyncio, threading, and ThreadPoolExecutor (from concurrent.futures). For brevity, I have omitted the code, but all show comparable execution time to the linear approach. All are designed for I/O tasks, so are confined by GIL.
My first observation is that the running time of function numbers can be cut roughly in half by simply defining it as:
def numbers(a, b):
return range(a, b)
Second, a task that is 100% CPU-intensive as is computing the sum of numbers can never perform significantly better using pure Python without the aid of a C-language runtime library (such as numpy) because of contention for the Global Interpreter Lock (GIL), which prevents any sort of parallelization from occurring (and asyncio only uses a single thread to being with).
Third, the only way you can achieve a performance improvement running pure Python code against a 100% CPU task is with multiprocessing. But there is CPU overhead in creating the process pool and CPU overhead in passing arguments from the main process to the the address space in which the process pool's processes are running in and overhead again in passing back the results. So for there to be any performance improvement, the worker function linear_sum, cannot be trivial; it must require enough CPU processing to warrant the additional overhead I just mentioned.
The following benchmark runs the worker function, renamed to compute_sum and which now accepts as its argument a range. To further reduce overhead, I have introduced a function split that will take the passed range argument and generate multiple range instances removing the need to use numpy and to generate arrays. The benchmark computes the sum using a single thread (linear), a multithreading pool and a multiprocessing pool and is run twice for n = 2000 and n = 50_000_000. The benchmark displays the elapsed time and total CPU time across all processes.
For n = 2000, multiprocessing, as expected, performs worse than both linear and multithreading. For n = 50_000_000, multiprocessing's total CPU time is a bit higher than for linear and multithreading as is expected due to the additional aforementioned overhead. But now the elapsed time has gone down significantly. For both values of n, multithreading is a loser.
from multiprocessing.pool import Pool, ThreadPool
import time
def split(iterable, n):
k, m = divmod(len(iterable), n)
return (iterable[i * k + min(i, m):(i + 1) * k + min(i + 1, m)] for i in range(n))
def compute_sum(r):
t = time.process_time()
return (sum(r), time.process_time() - t)
if __name__ == '__main__':
for n in (2000, 50_000_000):
r = range(0, n+1)
t1 = time.time()
s, cpu = compute_sum(r)
elapsed = time.time() - t1
print(f'n = {n}, linear elapsed time = {elapsed}, total cpu time = {cpu}, sum = {s}')
t1 = time.time()
t2 = time.process_time()
thread_pool = ThreadPool(4)
s = 0
for return_value, process_time in thread_pool.imap_unordered(compute_sum, split(r, 4)):
s += return_value
elapsed = time.time() - t1
cpu = time.process_time() - t2
print(f'n = {n}, thread pool elapsed time = {elapsed}, total cpu time = {cpu}, sum = {s}')
thread_pool.close()
thread_pool.join()
t1 = time.time()
t2 = time.process_time()
pool = Pool(4)
s = 0
cpu = 0
for return_value, process_time in pool.imap_unordered(compute_sum, split(r, 4)):
s += return_value
cpu += process_time
elapsed = time.time() - t1
cpu += time.process_time() - t2
print(f'n = {n}, multiprocessing elapsed time = {elapsed}, total cpu time = {cpu}, sum = {s}')
pool.close()
pool.join()
print()
Prints:
n = 2000, linear elapsed time = 0.0, total cpu time = 0.0, sum = 2001000
n = 2000, thread pool elapsed time = 0.00700068473815918, total cpu time = 0.015625, sum = 2001000
n = 2000, multiprocessing elapsed time = 0.13200139999389648, total cpu time = 0.015625, sum = 2001000
n = 50000000, linear elapsed time = 2.0311124324798584, total cpu time = 2.03125, sum = 1250000025000000
n = 50000000, thread pool elapsed time = 2.050999164581299, total cpu time = 2.046875, sum = 1250000025000000
n = 50000000, multiprocessing elapsed time = 0.7579991817474365, total cpu time = 2.359375, sum = 125000002500000
I am trying to measure the advantage of pool class in multiprocessing module over normal programming and I am calculating square of a number using a function . Now when I calculate the time take to find the square of all the three numbers it takes around ~0.24 sec but when I calculate it normally in a for loop it takes even less ~0.007 sec. Why is that? Shouldn't the code part with pool should be faster?
import time
from multiprocessing import Pool,Process
def f(x):
return x*x
if __name__ == '__main__':
start = time.time()
array = []
for i in range(1000000):
array.append(i)
with Pool(4) as p:
(p.map(f, array))
print(time.time()-start) # time taken when using pool
start1 = time.time()
for i in range(1000000):
f(array[i])
print(time.time()-start1) # time taken normaly
So as suggested by #klaus D. and #wwii I was not having enough computation to overcome the overhead of spawning processes and the time taken in switching between the processes.
Below is the updated code to notice the difference. Hope it helps
import multiprocessing
import time
import random
from multiprocessing import Pool,Process
def f(x):
time.sleep(3)
if __name__ == '__main__':
array = []
for i in range(4):
array.append(i)
start = time.time()
with Pool(4) as p:
(p.map(f, array))
print(time.time()-start) # time taken when using pool
start1 = time.time()
for i in range(4):
f(array[i])
print(time.time()-start1) # time taken normaly
The problem is that your function for workers in pool is too simple to be improved by parallelism:
try this:
import time
from multiprocessing import Pool,Process
N = 80
M = 1_000_000
def f_std(array):
"""
Calculate Standard deviation
"""
mean = sum(array)/len(array)
std = ((sum(map(lambda x: (x-mean)**2, array)))/len(array))**.5
return std
if __name__ == '__main__':
array = []
for i in range(N):
array.append(range(M))
start = time.time()
with Pool(8) as p:
(p.map(f_std, array))
print(time.time()-start) # time taken when using pool
start1 = time.time()
for i in range(N):
f_std(array[i])
print(time.time()-start1) # time taken normaly
I've got an unusual question for python. I'm using the multiprocessing library to map a function f((dynamic1, dynamic2), fix1, fix2).
import multiprocessing as mp
fix1 = 4
fix2 = 6
# Numer of cores to use
N = 6
dynamic_duos = [(a, b) for a in range(5) for b in range(10)]
with mp.Pool(processes = N) as p:
p.starmap(f, [(dyn, fix1, fix2) for dyn in dynamic_duos])
I would like to control dynamically the number of active processes because the function is actually pumping sometimes a LOT of RAM. The idea would be to check at every iteration (i.e. before any call of the function f) if the sum(dyn) is inferior to a threshold and if the amount of RAM is above a threshold. If the conditions are matched, then a new process can start and compute the function.
An additional condition would be the maximum number of processes: the number of cores on the PC.
Thanks for the help :)
Edit: Details on the reasons.
Some of the combinations of parameters will have a high RAM consumption (up to 80 Gb on 1 process). I know more or less which ones will use a lot of RAM, and when the program encounters them, I would like to wait for the other process to end, start in single process this high RAM consumption combination, and then resume the computation with more processes on the rest of the combination to map.
Edit on my try based on the answer below:
It doesn't work, but it doesn't raise an error. It just completes the program.
# Imports
import itertools
import concurrent.futures
# Parameters
N = int(input("Number of CPUs to use: "))
t0 = 0
tf = 200
s_step = 0.05
max_s = None
folder = "test"
possible_dynamics = [My_class(x) for x in [20, 30, 40, 50, 60]]
dynamics_to_compute = [list(x) for x in itertools.combinations_with_replacement(possible_dynamics , 2)] + [list(x) for x in itertools.combinations_with_replacement(possible_dynamics , 3)]
function_inputs = [(dyn , t0, tf, s_step, max_s, folder) for dyn in dynamics_to_compute]
# -----------
# Computation
# -----------
start = time.time()
# Pool creation and computation
futures = []
pool = concurrent.futures.ProcessPoolExecutor(max_workers = N)
for Obj, t0, tf, s_step, max_s, folder in function_inputs:
if large_memory(Obj, s_step, max_s):
concurrent.futures.wait(futures) # wait for all pending tasks
large_future = pool.submit(compute, Obj, t0, tf,
s_step, max_s, folder)
large_future.result() # wait for large computation to finish
else:
future = pool.submit(compute, Obj, t0, tf,
s_step, max_s, folder)
futures.append(future)
end = time.time()
if round(end-start, 3) < 60:
print ("Complete - Elapsed time: {} s".format(round(end-start,3)))
else:
print ("Complete - Elapsed time: {} mn and {} s".format(int((end-start)//60), round((end-start)%60,3)))
os.system("pause")
This is still a simplified example of my code, but the idea is here. It runs in less than 0.2 s, which means he actually never called the function compute.
N.B: Obj is not the actual variable name.
To achieve so you need to give up on the use of map to gain more control on the execution flow of your tasks.
This code implements the algorithm you described at the end of your question. I'd recommend using concurrent.futures library as it expose a more neat set of APIs.
import concurrent.futures
pool = concurrent.futures.ProcessPoolExecutor(max_workers=6)
futures = []
for dyn, fix1, fix2 in dynamic_duos:
if large_memory(dyn, fix1, fix2):
concurrent.futures.wait(futures) # wait for all pending tasks
large_future = pool.submit(f, dyn, fix1, fix2)
large_future.result() # wait for large computation to finish
else:
future = pool.submit(f, dyn, fix1, fix2)
futures.append(future)
So I've been messing around with python's multiprocessing lib for the last few days and I really like the processing pool. It's easy to implement and I can visualize a lot of uses. I've done a couple of projects I've heard about before to familiarize myself with it and recently finished a program that brute forces games of hangman.
Anywho, I was doing an execution time compairison of summing all the prime numbers between 1 million and 2 million both single threaded and through a processing pool. Now, for the hangman cruncher, putting the games in a processing pool improved execution time by about 8 times (i7 with 8 cores), but when grinding out these primes, it actually increased processing time by almost a factor of 4.
Can anyone tell me why this is? Here is the code for anyone interested in looking at or testing it:
#!/user/bin/python.exe
import math
from multiprocessing import Pool
global primes
primes = []
def log(result):
global primes
if result:
primes.append(result[1])
def isPrime( n ):
if n < 2:
return False
if n == 2:
return True, n
max = int(math.ceil(math.sqrt(n)))
i = 2
while i <= max:
if n % i == 0:
return False
i += 1
return True, n
def main():
global primes
#pool = Pool()
for i in range(1000000, 2000000):
#pool.apply_async(isPrime,(i,), callback = log)
temp = isPrime(i)
log(temp)
#pool.close()
#pool.join()
print sum(primes)
return
if __name__ == "__main__":
main()
It'll currently run in a single thread, to run through the processing pool, uncomment the pool statements and comment out the other lines in the main for loop.
the most efficient way to use multiprocessing is to divide the work into n equal sized chunks, with n the size of the pool, which should be approximately the number of cores on your system. The reason for this is that the work of starting subprocesses and communicating between them is quite large. If the size of the work is small compared to the number of work chunks, then the overhead of IPC becomes significant.
In your case, you're asking multiprocessing to process each prime individually. A better way to deal with the problem is to pass each worker a range of values, (probably just a start and end value) and have it return all of the primes in that range it found.
In the case of identifying large-ish primes, the work done grows with the starting value, and so You probably don't want to divide the total range into exactly n chunks, but rather n*k equal chunks, with k some reasonable, small number, say 10 - 100. that way, when some workers finish before others, there's more work left to do and it can be balanced efficiently across all workers.
Edit: Here's an improved example to show what that solution might look like. I've changed as little as possible so you can compare apples to apples.
#!/user/bin/python.exe
import math
from multiprocessing import Pool
global primes
primes = set()
def log(result):
global primes
if result:
# since the result is a batch of primes, we have to use
# update instead of add (or for a list, extend instead of append)
primes.update(result)
def isPrime( n ):
if n < 2:
return False
if n == 2:
return True, n
max = int(math.ceil(math.sqrt(n)))
i = 2
while i <= max:
if n % i == 0:
return False
i += 1
return True, n
def isPrimeWorker(start, stop):
"""
find a batch of primes
"""
primes = set()
for i in xrange(start, stop):
if isPrime(i):
primes.add(i)
return primes
def main():
global primes
pool = Pool()
# pick an arbitrary chunk size, this will give us 100 different
# chunks, but another value might be optimal
step = 10000
# use xrange instead of range, we don't actually need a list, just
# the values in that range.
for i in xrange(1000000, 2000000, step):
# call the *worker* function with start and stop values.
pool.apply_async(isPrimeWorker,(i, i+step,), callback = log)
pool.close()
pool.join()
print sum(primes)
return
if __name__ == "__main__":
main()