For this dask code:
def inc(x):
return x + 1
for x in range(5):
array[x] = delay(inc)(x)
I want to access all the elements in array by executing the delayed tasks. But I can't call array.compute() since array is not a function. If I do
for x in range(5):
array[x].compute()
then does each task gets executed in parallel or does a[1] get fired only after a[0] terminates? Is there a better way to write this code?
You can use the dask.compute function to compute many delayed values at once
from dask import delayed, compute
array = [delayed(inc)(i) for i in range(5)]
result = compute(*array)
It's easy to tell if things are executing in parallel if you force them to take a long time. If you run this code:
from time import sleep, time
from dask import delayed
start = time()
def inc(x):
sleep(1)
print('[inc(%s): %s]' % (x, time() - start))
return x + 1
array = [0] * 5
for x in range(5):
array[x] = delayed(inc)(x)
for x in range(5):
array[x].compute()
It becomes very obvious that the calls happen in sequence. However if you replace the last loop with this:
delayed(array).compute()
then you can see that they are in parallel. On my machine the output looks like this:
[inc(1): 1.00373506546]
[inc(4): 1.00429320335]
[inc(2): 1.00471806526]
[inc(3): 1.00475406647]
[inc(0): 2.00795912743]
Clearly the first four tasks it executed were in parallel. Presumably the default parallelism is set to the number of cores on the machine, because for CPU intensive tasks it's not generally useful to have more.
Related
I am trying to streamline a program that involves a set of short tasks that can be done in parallel, where the results of the set of tasks must be compared before moving onto the next step (which again involves a set of short tasks, and then another set, etc.). Due to the level of complexity of these tasks, it's not worthwhile to use multiprocessing due to the set-up time. I am wondering if there is another way to do these short tasks in parallel that is faster than linear. The only question I can find on this site that describes this problem for Python references this answer on memory sharing which I don't think answers my question (or if it does I could not follow how).
To illustrate what I am hoping to do, consider the problem of summing a bunch of numbers from 0 to N. (Of course this can be solved analytically, my point is to come up with a low-memory but short CPU-intensive task.) First, the linear approach would simply be:
def numbers(a,b):
return(i for i in range(a,b))
def linear_sum(a):
return(sum(numbers(a[0],a[1])))
n = 2000
linear_sum([0, n+1])
#2001000
For threading, I want to break the problem into parts that can then be summed separately and then combined, so the idea would be to get a bunch of ranges over which to sum with something like
def get_ranges(i, Nprocess = 3):
di = i // Nprocess
j = np.append(np.arange(0, i, di), [i+1,])
return([(j[k], j[k+1]) for k in range(len(j)-1)])
and for some value n >> NProcesses the pseudocode example would be something like
values = get_ranges(n)
x = []
for value in values:
x.append(do_someting_parallel(value))
return(sum(x))
The question then, is how to implement do_someting_parallel? For multiprocessing, we can do something like:
from multiprocessing import Pool as ThreadPool
def mpc_thread_sum(i, Nprocess = 3):
values = get_ranges(i)
pool = ThreadPool(Nprocess)
results = pool.map(linear_sum, values)
pool.close()
pool.join()
return(sum(results))
print(mpc_thread_sum(2000))
# 2001000
The graph below shows the performance of the different approaches described. Is there a way to speed up computations for the region where multiprocessing is still slower than linear or is this the limit of parallelization in Python's GIL? I suspect the answer may be that I am hitting my limit but wanted to ask here to be sure. I tried multiprocessing.dummy, asyncio, threading, and ThreadPoolExecutor (from concurrent.futures). For brevity, I have omitted the code, but all show comparable execution time to the linear approach. All are designed for I/O tasks, so are confined by GIL.
My first observation is that the running time of function numbers can be cut roughly in half by simply defining it as:
def numbers(a, b):
return range(a, b)
Second, a task that is 100% CPU-intensive as is computing the sum of numbers can never perform significantly better using pure Python without the aid of a C-language runtime library (such as numpy) because of contention for the Global Interpreter Lock (GIL), which prevents any sort of parallelization from occurring (and asyncio only uses a single thread to being with).
Third, the only way you can achieve a performance improvement running pure Python code against a 100% CPU task is with multiprocessing. But there is CPU overhead in creating the process pool and CPU overhead in passing arguments from the main process to the the address space in which the process pool's processes are running in and overhead again in passing back the results. So for there to be any performance improvement, the worker function linear_sum, cannot be trivial; it must require enough CPU processing to warrant the additional overhead I just mentioned.
The following benchmark runs the worker function, renamed to compute_sum and which now accepts as its argument a range. To further reduce overhead, I have introduced a function split that will take the passed range argument and generate multiple range instances removing the need to use numpy and to generate arrays. The benchmark computes the sum using a single thread (linear), a multithreading pool and a multiprocessing pool and is run twice for n = 2000 and n = 50_000_000. The benchmark displays the elapsed time and total CPU time across all processes.
For n = 2000, multiprocessing, as expected, performs worse than both linear and multithreading. For n = 50_000_000, multiprocessing's total CPU time is a bit higher than for linear and multithreading as is expected due to the additional aforementioned overhead. But now the elapsed time has gone down significantly. For both values of n, multithreading is a loser.
from multiprocessing.pool import Pool, ThreadPool
import time
def split(iterable, n):
k, m = divmod(len(iterable), n)
return (iterable[i * k + min(i, m):(i + 1) * k + min(i + 1, m)] for i in range(n))
def compute_sum(r):
t = time.process_time()
return (sum(r), time.process_time() - t)
if __name__ == '__main__':
for n in (2000, 50_000_000):
r = range(0, n+1)
t1 = time.time()
s, cpu = compute_sum(r)
elapsed = time.time() - t1
print(f'n = {n}, linear elapsed time = {elapsed}, total cpu time = {cpu}, sum = {s}')
t1 = time.time()
t2 = time.process_time()
thread_pool = ThreadPool(4)
s = 0
for return_value, process_time in thread_pool.imap_unordered(compute_sum, split(r, 4)):
s += return_value
elapsed = time.time() - t1
cpu = time.process_time() - t2
print(f'n = {n}, thread pool elapsed time = {elapsed}, total cpu time = {cpu}, sum = {s}')
thread_pool.close()
thread_pool.join()
t1 = time.time()
t2 = time.process_time()
pool = Pool(4)
s = 0
cpu = 0
for return_value, process_time in pool.imap_unordered(compute_sum, split(r, 4)):
s += return_value
cpu += process_time
elapsed = time.time() - t1
cpu += time.process_time() - t2
print(f'n = {n}, multiprocessing elapsed time = {elapsed}, total cpu time = {cpu}, sum = {s}')
pool.close()
pool.join()
print()
Prints:
n = 2000, linear elapsed time = 0.0, total cpu time = 0.0, sum = 2001000
n = 2000, thread pool elapsed time = 0.00700068473815918, total cpu time = 0.015625, sum = 2001000
n = 2000, multiprocessing elapsed time = 0.13200139999389648, total cpu time = 0.015625, sum = 2001000
n = 50000000, linear elapsed time = 2.0311124324798584, total cpu time = 2.03125, sum = 1250000025000000
n = 50000000, thread pool elapsed time = 2.050999164581299, total cpu time = 2.046875, sum = 1250000025000000
n = 50000000, multiprocessing elapsed time = 0.7579991817474365, total cpu time = 2.359375, sum = 125000002500000
Disclaimer: I have gone through loads of multiprocessing answers on SO and also documents and either the questions were really old (Python 3.X has made tons of improvements since) or did not find a clear answer. If I might have missed out something relevant do point me in the right direction.
I started with a simple function that I defined as below in my folder module since I am running of Jupyter Notebook and it seems that due to conflicts, you can only run multiprocessing on an imported function:
def f(a):
return a * 100
Built some test data and ran some test:
from itertools import zip_longest
from multiprocessing import Process, Pool, Array, Queue
from time import time
from modules.test import *
li = [i for i in range(1000000)]
List comprehension: Really Fast
start = time()
tests = [f(i) for i in li]
print(f'Total time {time() - start} s')
>> Total time 0.154066801071167 s
Answer of an SO example here: 11 seconds or so
start = time()
results = []
if __name__ == '__main__':
jobs = 4
size = len(li)
heads = list(range(size//jobs, size, size//jobs)) + [size]
tails = range(0,size,size//jobs)
pool = Pool(4)
for tail,head in zip(tails, heads):
r = pool.apply_async(f, args=(li[tail:head],))
results.append(r)
pool.close()
pool.join() # wait for the pool to be done
print(f'Total time {time() - start} s')
>>Total time 11.087551593780518 s
And there is Process which I do not know whether will be applicable to the example above. I am unfamiliar with multiprocessing but do understand that there is some overhead in creating new instances and what not, but as the data grows larger it should justify the overhead.
My question is, with the current performances in Python 3.x, is using multiprocessing in running similar operations to the above still relevant or something one should even attempt. And if it is, how can they be applied in parallelizing the workload.
Most of the examples I have read and understand are used for web scraping when there is an actual idle time in one process receiving information and it makes sense to parallelize but how would want approach it if you are running computations of something like a list or dictionary.
The reason your example is not performing well is because you are doing two totally different things.
In your list comprehension, you are mapping f onto each element of li.
In the second case, you are splitting your li list into jobs chunks and then apply your functon jobs times onto each of those chunks. And now, in f, n * 100 takes a chunk about a quarter the size of your original list, and multiplies it by 100, i.e., it uses the sequence-repitition operator, so creates a new list 100-times the size of the chunk:
>>> chunk = [1,2,3]
>>> chunk * 10
[1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3]
>>>
So basically, you are comparing apples to oranges.
However, multiprocessing already comes with an out-of-the box mapping utility. Here is a better comparison, a script called foo.py:
import time
import multiprocessing as mp
def f(x):
return x * 100
if __name__ == '__main__':
data = list(range(1000000))
start = time.time()
[f(i) for i in data]
stop = time.time()
print(f"List comprehension took {stop - start} seconds")
start = time.time()
with mp.Pool(4) as pool:
result = pool.map(f, data)
stop = time.time()
print(f"Pool.map took {stop - start} seconds")
Now, here's some actual performance results:
(py37) Juans-MBP:test_mp juan$ python foo.py
List comprehension took 0.14193987846374512 seconds
Pool.map took 0.2513458728790283 seconds
(py37) Juans-MBP:test_mp juan$
For this very trivial function, the cost of the inter-process communication will always be higher than the cost of calculating the function serially. So you won't see any gains from multiprocessing. However, a much less trivial function can see gains from multiprocessing.
Here's a trivial example, I simply sleep for a microsecond before multiplying:
import time
import multiprocessing as mp
def f(x):
time.sleep(0.000001)
return x * 100
if __name__ == '__main__':
data = list(range(1000000))
start = time.time()
[f(i) for i in data]
stop = time.time()
print(f"List comprehension took {stop - start} seconds")
start = time.time()
with mp.Pool(4) as pool:
result = pool.map(f, data)
stop = time.time()
print(f"Pool.map took {stop - start} seconds")
And now, you see gains commensurate with the number of processes:
(py37) Juans-MBP:test_mp juan$ python foo.py
List comprehension took 13.175776720046997 seconds
Pool.map took 3.1484851837158203 seconds
Note, on my machine, a single multiplication takes orders of magnitude less time than a microsecond (about 10 nanoseconds):
>>> import timeit
>>> timeit.timeit('100*3', number=int(1e6))*1e-6
1.1292944999993892e-08
I want to do something parallelly but it always goes slower. I put an example of two code snippets which can be compared. The multiprocessing way needs 12 seconds on my laptop. The sequential way only 3 seconds. I thought multiprocessing is faster.
I know that the task in this way does not make any sense but it is just made to compare the two ways. I know bubble sort can be replaced by faster ways.
Thanks.
Multiprocessing way:
from multiprocessing import Process, Manager
import os
import random
myArray = []
for i in range(1000):
myArray.append(random.randint(1,1000))
def getRandomSample(myset, sample_size):
sorted_list = sorted(random.sample(xrange(len(myset)), sample_size))
return([myset[i] for i in sorted_list])
def bubbleSort(iterator,alist, return_dictionary):
sample_list = (getRandomSample(alist, 100))
for passnum in range(len(sample_list)-1,0,-1):
for i in range(passnum):
if sample_list[i]>alist[i+1]:
temp = alist[i]
sample_list[i] = alist[i+1]
sample_list[i+1] = temp
return_dictionary[iterator] = sample_list
if __name__ == '__main__':
manager = Manager()
return_dictionary = manager.dict()
jobs = []
for i in range(3000):
p = Process(target=bubbleSort, args=(i,myArray,return_dictionary))
jobs.append(p)
p.start()
for proc in jobs:
proc.join()
print return_dictionary.values()
The other way:
import os
import random
myArray = []
for i in range(1000):
myArray.append(random.randint(1,1000))
def getRandomSample(myset, sample_size):
sorted_list = sorted(random.sample(xrange(len(myset)), sample_size))
return([myset[i] for i in sorted_list])
def bubbleSort(alist):
sample_list = (getRandomSample(alist, 100))
for passnum in range(len(sample_list)-1,0,-1):
for i in range(passnum):
if sample_list[i]>alist[i+1]:
temp = alist[i]
sample_list[i] = alist[i+1]
sample_list[i+1] = temp
return(sample_list)
if __name__ == '__main__':
results = []
for i in range(3000):
results.append(bubbleSort(myArray))
print results
Multiprocessing is faster if you have multiple cores and do the parallelization properly. In your example you create 3000 processes which causes enormous amount on context switching between them. Instead of that use Pool to schedule the jobs for processes:
def bubbleSort(alist):
sample_list = (getRandomSample(alist, 100))
for passnum in range(len(sample_list)-1,0,-1):
for i in range(passnum):
if sample_list[i]>alist[i+1]:
temp = alist[i]
sample_list[i] = alist[i+1]
sample_list[i+1] = temp
return(sample_list)
if __name__ == '__main__':
pool = Pool(processes=4)
for x in pool.imap_unordered(bubbleSort, (myArray for x in range(3000))):
pass
I removed all the output and did some tests on my 4 core machine. As expected the code above was about 4 times faster than your sequential example.
Multiprocessing is not just magically faster. The thing is that your computer still has to do the same amount of work. It's like if you try to do multiple tasks at once, it's not going to be faster.
In a "normal" program, doing it sequential is easier to read and write (that it is that much faster too surprises me a little). Multiprocessing is especially useful if you have to wait for another process like a web request (you can send multiple at once and don't have to wait for each) or having some sort of event loop.
My guess as to why it is faster is that Python already uses multiprocessing internally wherever it makes sense (don't quote me on that). Also with threading it has to keep track of what is where, which means more overhead.
So, if we go back to the example in the real world, if you give a task to somebody else and instead of waiting for it, you do other things at the same time as them, then you are faster.
I wrote the following code which call function (compute_cluster) 6 times in parallel (each run of this function is independent of the other run and each run write the results in a separate file), the following is my code:
global L
for L in range(6,24):
pool = Pool(6)
pool.map(compute_cluster,range(1,3))
pool.close()
if __name__ == "__main__":
main(sys.argv)
despite the fact that I'm running this code on a I7 processor machine, and no matter how much I set the Pool to it's always running only two processes in parallel so is there any suggestion on how can I run 6 processes in parallel? such that the first three processes use L=6 and call compute_cluster with parameter values from 1:3 in parallel and at the same time the other three processes run the same function with the same parameter values but this time the Global L value is 7 ?
any suggestions is highly appreciated
There are a few things wrong here. First, as to why you always only have 2 processes going at a time -- The reason is because range(1, 3) only returns 2 values. So you're only giving the pool 2 tasks to do before you close it.
The second issue is that you're relying on global state. In this case, the code probably works, but it's limiting your performance since it is the factor which is preventing you from using all your cores. I would parallelize the L loop rather than the "inner" range loop. Something like1:
def wrapper(tup):
l, r = tup
# Even better would be to get rid of `L` and pass it to compute_cluster
global L
L = l
compute_cluster(r)
for r in range(1, 3):
p = Pool(6)
p.map(wrapper, [(l, r) for l in range(6, 24)])
p.close()
This works with the global L because each spawned process picks up its own copy of L -- It doesn't get shared between processes.
1Untested code
As pointed out in the comments, we can even pull the Pool out of the loop:
p = Pool(6)
p.map(wrapper, [(l, r) for l in range(6, 24) for r in range(1, 3)])
p.close()
I'm trying to evaluate a chi squared function, i.e. compare an arbitrary (blackbox) function to a numpy vector array of data. At the moment I'm looping over the array in python but something like this is very slow:
n=len(array)
sigma=1.0
chisq=0.0
for i in range(n):
data = array[i]
model = f(i,a,b,c)
chisq += 0.5*((data-model)/sigma)**2.0
return chisq
array is a 1-d numpy array and a,b,c are scalars. Is there a way to speed this up by using numpy.sum() or some sort of lambda function etc.? I can see how to remove one loop (over chisq) like this:
numpy.sum(((array-model_vec)/sigma)**2.0)
but then I still need to explicitly populate the array model_vec, which will presumably be just as slow; how do I do that without an explicit loop like this:
model_vec=numpy.zeros(len(data))
for i in range(n):
model_vec[i] = f(i,a,b,c)
return numpy.sum(((array-model_vec)/sigma)**2.0)
?
Thanks!
You can use np.vectorize to 'vectorize' your function f if you don't have control over its definition:
g = np.vectorize(f)
But this is not as good as vectorizing the function yourself manually to support arrays, as it doesn't really do much more than internalize the loop, and it might not work well with certain functions. In fact, from the documentation:
Notes The vectorize function is provided primarily for convenience, not for performance. The implementation is essentially a for loop.
You should instead focus on making f accept a vector instead of i:
def f(i, a, b, x):
return a*x[i] + b
def g(a, b, x):
x = np.asarray(x)
return a*x + b
Then, instead of calling f(i, a, b, x), call g(a,b,x)[i] if you only want the ith, but for operations on the entire function, use g(a, b, x) and it will be much faster.
model_vec = g(a, b, x)
return numpy.sum(((array-model_vec)/sigma)**2.0)
It seems that your code is slow because what is executing in the loop is slow (your model generation). Turning this into a one-liner won't speed things up. If you have access to a modern computer with more than on CPU you could try to run this loop in parallel - for example using the multiprocessing module;
from multiprocessing import Pool
if __name__ == '__main__':
# snip set up code
pool = Pool(processes=4) # start 4 worker processes
inputs = [(i,a,b,c) for i in range(n)]
model_array = pool.map(model, inputs)
for i in range(n):
data = array[i]
model = model_array[i]
chisq += 0.5*((data-model)/sigma)**2.0