So using the multiprocess module it is easy to run a function in parallel with different arguments like this:
from multiprocessing import Pool
def f(x):
return x**2
p = Pool(2)
print(p.map(f, [1, 2]))
But I'm interested in executing a list of functions on the same argument. Suppose I have the following two functions:
def f(x):
return x**2
def g(x):
return x**3 + 2
How can I execute them in parallel for the same argument (e.g. x=1)?
You can use Pool.apply_async() for that. You bundle up tasks in the form of (function, argument_tuple) and feed every task to apply_async().
from multiprocessing import Pool
from itertools import repeat
def f(x):
for _ in range(int(50e6)): # dummy computation
pass
return x ** 2
def g(x):
for _ in range(int(50e6)): # dummy computation
pass
return x ** 3
def parallelize(n_workers, functions, arguments):
# if you need this multiple times, instantiate the pool outside and
# pass it in as dependency to spare recreation all over again
with Pool(n_workers) as pool:
tasks = zip(functions, repeat(arguments))
futures = [pool.apply_async(*t) for t in tasks]
results = [fut.get() for fut in futures]
return results
if __name__ == '__main__':
N_WORKERS = 2
functions = f, g
results = parallelize(N_WORKERS, functions, arguments=(10,))
print(results)
Example Output:
[100, 1000]
Process finished with exit code 0
You can get a tuple returned. This could be done quite easily and in a very compact way using the lightweight module: joblib. I recommend joblib because it is lightweight
from joblib import Parallel, delayed
import multiprocessing
import timeit
# Implementation 1
def f(x):
return x**2, x**3 + 2
#Implementation 2 for a more sophisticated second or more functions
def g(x):
return x**3 + 2
def f(x):
return x**2, g(x)
if __name__ == "__main__":
inputs = [i for i in range(32)]
num_cores = multiprocessing.cpu_count()
t1 = timeit.Timer()
result = Parallel(n_jobs=num_cores)(delayed(f)(i) for i in inputs)
print(t1.timeit(1))
Using multiprocessing.Pool as you already have in the question
from multiprocessing import Pool, cpu_count
import timeit
def g(x):
return x**3 + 2
def f(x):
return x**2, g(x)
if __name__ == "__main__":
inputs = [i for i in range(32)]
num_cores = cpu_count()
p = Pool(num_cores)
t1 = timeit.Timer()
result = p.map(f, inputs)
print(t1.timeit(1))
print(result)
Example Output:
print(result)
[(0, 2), (1, 3), (4, 10), (9, 29), (16, 66), (25, 127), (36, 218), (49, 345),
(64, 514), (81, 731), (100, 1002), (121, 1333), (144, 1730), (169, 2199),
(196, 2746), (225, 3377), (256, 4098), (289, 4915), (324, 5834), (361, 6861),
(400, 8002), (441, 9263), (484, 10650), (529, 12169), (576, 13826), (625,
15627), (676, 17578), (729, 19685), (784, 21954), (841, 24391), (900, 27002),
(961, 29793)]
print(t1.timeit(1))
5.000001692678779e-07 #(with 16 cpus and 64 Gb RAM)
for: inputs = range(2000), it took the time:
1.100000190490391e-06
Related
Consider the following script:
import numpy as np
import tracemalloc
def zero_mem():
a = np.zeros((100, 100))
def nonzero_mem():
b = np.random.randn(100, 100)
if __name__ == "__main__":
tracemalloc.start()
zero_mem()
print(tracemalloc.get_traced_memory())
tracemalloc.stop()
tracemalloc.start()
nonzero_mem()
print(tracemalloc.get_traced_memory())
tracemalloc.stop()
The output running numpy 1.22.2 on python 3.8.10 is
(0, 80096)
(72, 80168)
The question is: why isn't the second row (0, 80168)? In other words: why is there memory still in use after nonzero_mem(), unlike when calling zero_mem()?
Here's a timed example of multiple image arrays of different sizes being saved in a loop as well as concurrently using threads / processes:
import tempfile
from concurrent.futures import ProcessPoolExecutor, ThreadPoolExecutor, as_completed
from pathlib import Path
from time import perf_counter
import numpy as np
from cv2 import cv2
def save_img(idx, image, dst):
cv2.imwrite((Path(dst) / f'{idx}.jpg').as_posix(), image)
if __name__ == '__main__':
l1 = np.random.randint(0, 255, (100, 50, 50, 1))
l2 = np.random.randint(0, 255, (1000, 50, 50, 1))
l3 = np.random.randint(0, 255, (10000, 50, 50, 1))
temp_dir = tempfile.mkdtemp()
workers = 4
t1 = perf_counter()
for ll in l1, l2, l3:
t = perf_counter()
for i, img in enumerate(ll):
save_img(i, img, temp_dir)
print(f'Time for {len(ll)}: {perf_counter() - t} seconds')
for executor in ThreadPoolExecutor, ProcessPoolExecutor:
with executor(workers) as ex:
futures = [
ex.submit(save_img, i, img, temp_dir) for (i, img) in enumerate(ll)
]
for f in as_completed(futures):
f.result()
print(
f'Time for {len(ll)} ({executor.__name__}): {perf_counter() - t} seconds'
)
And I get these durations on my i5 mbp:
Time for 100: 0.09495482999999982 seconds
Time for 100 (ThreadPoolExecutor): 0.14151873999999998 seconds
Time for 100 (ProcessPoolExecutor): 1.5136184309999998 seconds
Time for 1000: 0.36972280300000016 seconds
Time for 1000 (ThreadPoolExecutor): 0.619205703 seconds
Time for 1000 (ProcessPoolExecutor): 2.016624468 seconds
Time for 10000: 4.232915643999999 seconds
Time for 10000 (ThreadPoolExecutor): 7.251599262 seconds
Time for 10000 (ProcessPoolExecutor): 13.963426469999998 seconds
Aren't threads / processes expected to require less time to achieve the same thing? and why not in this case?
The timings in the code are wrong because the timer t is not reset before testing the Pools. Nevertheless, the relative order of the timings are correct. A possible code with a timer reset is:
import tempfile
from concurrent.futures import ProcessPoolExecutor, ThreadPoolExecutor, as_completed
from pathlib import Path
from time import perf_counter
import numpy as np
from cv2 import cv2
def save_img(idx, image, dst):
cv2.imwrite((Path(dst) / f'{idx}.jpg').as_posix(), image)
if __name__ == '__main__':
l1 = np.random.randint(0, 255, (100, 50, 50, 1))
l2 = np.random.randint(0, 255, (1000, 50, 50, 1))
l3 = np.random.randint(0, 255, (10000, 50, 50, 1))
temp_dir = tempfile.mkdtemp()
workers = 4
for ll in l1, l2, l3:
t = perf_counter()
for i, img in enumerate(ll):
save_img(i, img, temp_dir)
print(f'Time for {len(ll)}: {perf_counter() - t} seconds')
for executor in ThreadPoolExecutor, ProcessPoolExecutor:
t = perf_counter()
with executor(workers) as ex:
futures = [
ex.submit(save_img, i, img, temp_dir) for (i, img) in enumerate(ll)
]
for f in as_completed(futures):
f.result()
print(
f'Time for {len(ll)} ({executor.__name__}): {perf_counter() - t} seconds'
)
Multithreading is faster specially for I/O bound processes. In this case, compressing the images is cpu-intensive, so depending on the implementation of OpenCV and of the python wrapper, multithreading can be much slower. In many cases the culprit is CPython's GIL, but I am not sure if this is the case (I do not know if the GIL is released during the imwrite call). In my setup (i7 8th gen), Threading is as fast as the loop for 100 images and barely faster for 1000 and 10000 images. If ThreadPoolExecutor reuses threads, there is an overhead involved in assigning a new task to an existing thread. If it does not reuses threads, there is an overhead involved in launching a new thread.
Multiprocessing circumvents the GIL issue, but has some other problems. First, pickling the data to pass between processes takes some time, and in the case of images it can be very expensive. Second, in the case of windows, spawning a new process takes a lot of time. A simple test to see the overhead (both for processes and threads) is to change the save_image function by one that does nothing, but still need pickling, etc:
def save_img(idx, image, dst):
if idx != idx:
print("impossible!")
and by a similar one without parameters to see the overhead of spawning the processes, etc.
The timings in my setup show that 2.3 seconds are needed just to spawn the 10000 processes and 0.6 extra seconds for pickling, which is much more than the time needed for processing.
A way to improve the throughput and keep the overhead to a minimum is to break the work on chunks, and submit each chunk to the worker:
import tempfile
from concurrent.futures import ProcessPoolExecutor, ThreadPoolExecutor, as_completed
from pathlib import Path
from time import perf_counter
import numpy as np
from cv2 import cv2
def save_img(idx, image, dst):
cv2.imwrite((Path(dst) / f'{idx}.jpg').as_posix(), image)
def multi_save_img(idx_start, images, dst):
for idx, image in zip(range(idx_start, idx_start + len(images)), images):
cv2.imwrite((Path(dst) / f'{idx}.jpg').as_posix(), image)
if __name__ == '__main__':
l1 = np.random.randint(0, 255, (100, 50, 50, 1))
l2 = np.random.randint(0, 255, (1000, 50, 50, 1))
l3 = np.random.randint(0, 255, (10000, 50, 50, 1))
temp_dir = tempfile.mkdtemp()
workers = 4
for ll in l1, l2, l3:
t = perf_counter()
for i, img in enumerate(ll):
save_img(i, img, temp_dir)
print(f'Time for {len(ll)}: {perf_counter() - t} seconds')
chunk_size = len(ll)//workers
ends = [chunk_size * (_+1) for _ in range(workers)]
ends[-1] += len(ll) % workers
starts = [chunk_size * _ for _ in range(workers)]
for executor in ThreadPoolExecutor, ProcessPoolExecutor:
t = perf_counter()
with executor(workers) as ex:
futures = [
ex.submit(multi_save_img, start, ll[start:end], temp_dir) for (start, end) in zip(starts, ends)
]
for f in as_completed(futures):
f.result()
print(
f'Time for {len(ll)} ({executor.__name__}): {perf_counter() - t} seconds'
)
This should give you a significant boost over a simple for, both for a multiprocessing and multithreading approach.
I am working in a Jupyter notebook. I'm new to multiprocessing in python, and I'm trying to parallelize the calculation of a function for a grid of parameters. Here is a snippet of code quite representative of what I'm doing:
import os
import numpy as np
from concurrent.futures import ProcessPoolExecutor
def f(x,y):
print(os.getpid(), x,y,x+y)
return x+y
xs = np.linspace(5,7,3).astype(int)
ys = np.linspace(1,3,3).astype(int)
func = lambda p: f(*p)
with ProcessPoolExecutor() as executor:
args = (arg for arg in zip(xs,ys))
results = executor.map(func, args)
for res in results:
print(res)
The executor doesn't even start.
No problem whatsoever if I serially execute the same with, e.g. list comprehension,
args = (arg for arg in zip(xs,ys))
results = [func(arg) for arg in args]
Are you running on Windows? I think your main problem is that each process is trying to re-execute your whole script, so you should include an if name == "main" check. I think you have a second issue trying to use a lambda function that can't be pickled, since the processes communicate by pickling the data. There are work-arounds for that but in this case it looks like you don't really need the lambda. Try something like this:
import os
import numpy as np
from concurrent.futures import ProcessPoolExecutor
def f(x, y):
print(os.getpid(), x, y, x + y)
return x + y
if __name__ == '__main__':
xs = np.linspace(5, 7, 3).astype(int)
ys = np.linspace(1, 3, 3).astype(int)
with ProcessPoolExecutor() as executor:
results = executor.map(f, xs, ys)
for res in results:
print(res)
Suppose I have a want to plot the density on the x-y plane, the density is defined as:
def density(x,y):
return x**2 +y**2
If I have many points (x1,y1), (x2,y2)... to calculate, therefore I want to do it parallel. I found the doc multiprocessing and try to do the following:
pointsList = [(1,1), (2,2), (3,3)]
from multiprocessing import Pool
if __name__ == '__main__':
with Pool() as p:
print(p.map(density,pointsList ))
the error occurs and it seems that I failed to pass the args to the function, how to do this?
Edit:
the error is:
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-647-1e2a1f0007fb> in <module>()
5 from multiprocessing import Pool
6 if __name__ == '__main__':
----> 7 with Pool() as p:
8 print(p.map(density,pointsList ))
AttributeError: __exit__
Edit2:
If I can't do this simple parallel in python2.7, how can I do it in python3.5 for instance?
The use of Pool in a context manager was added in Python 3.3. Since you tagged Python 2.7, you can't use the with syntax.
Documentation:
New in version 3.3: Pool objects now support the context management
protocol – see Context Manager Types. __enter__() returns the pool
object, and __exit__() calls terminate().
Here's the working example you wanted, for python 3.3+ :
def density(args):
x, y = args
return x**2 +y**2
pointsList = [(1,1), (2,2), (3,3)]
from multiprocessing import Pool
if __name__ == '__main__':
with Pool() as p:
print(p.map(density,pointsList ))
And since you're also using Python 2.7, you just need to not use the context manager and call p.terminate() instead:
def density(args):
x, y = args
return x**2 +y**2
pointsList = [(1,1), (2,2), (3,3)]
from multiprocessing import Pool
if __name__ == '__main__':
p = Pool()
print(p.map(density,pointsList ))
p.terminate()
Need to change the density function to unpack the tuple argument
def density(z):
(x,y) = z
return x**2 +y**2
try not use the with and close the pool yourself after you are done with it.
This way should be compatible for both python 2 and 3
from multiprocessing import Pool
pointsList = [(1,1), (2,2), (3,3)]
p = Pool()
print(p.map( density,pointsList ))
p.close()
or using contextlib module
from multiprocessing import Pool
import contextlib
pointsList = [(1,1), (2,2), (3,3)]
with contextlib.closing(Pool()) as p:
print(p.map( density,pointsList ))
In the following code which is an example of my main code, I have tried to use pathos.multiprocessing to increase the speed of iteration of a loop. The output of each iteration which has implemented with multiprocessing is a 2-D array. I used pathos.multiprocessing instead of multiprocessing since I wanted to use it in my class method. I have used apipe method of the pathos.multiprocessing to collect the output in a list but it returns an empty list. I have no idea why it fails
import numpy as np
import random
import pathos.multiprocessing as mp
class Testsystematics(object):
def __init__(self, x, y, NTH = None, THMIN = None, THMAX = None, NRESAMPLE = None):
self.x = x
self.y = y
self.nbins = NTH
self.bmin = THMIN
self.bmax = THMAX
self.nresample= NRESAMPLE
self.bins = np.linspace(self.bmin, self.bmax, self.nbins+1, True).astype(np.float)
self.sample = np.array([[random.choice(range(len(self.y))) for _ in xrange(len(self.y))] for i in range(self.nresample)])
self.result_list=[]
def log_result(self, result):
self.result_list.append(result)
def bootstrapping(self, k):
xi_p = np.zeros(self.nbins, float)
xi_m = np.zeros(self.nbins, float)
nind = np.zeros(self.nbins, float)
for i in range(len(self.x)):
for j in range(len(self.x)):
if (i!=j):
sep= np.sqrt(self.x[i]**2+self.x[j]**2)
index= np.searchsorted(self.bins, sep , side='right')-1
sind = np.sin(sep)
if ((sep< self.bins[-1]) and (sep>=self.bins[0])):
xi_p[index] += sind*(np.mean(y)-np.median(y))
xi_m[index] += sind*np.std(y)
nind[index] += 1.0
for i in range(self.nbins):
xi_p[i]=xi_p[i]/nind[i]
xi_m[i]=xi_m[i]/nind[i]
return np.vstack((xi_p,xi_m))
def twopcf(self):
if (self.sys_type==1):
pool = mp.ProcessingPool(16)
for n in range(self.nresample):
pool.apipe(self.bootstrapping, args=(n,), callback=self.log_result)
shape,scale=0.5, 0.6
x=np.random.gamma(shape, scale, 10000)
mu1, sigma1 = 0, 0.5 # mean and standard deviation
mu2, sigma2 = 0.1, 0.7 # mean and standard deviation
y = np.random.normal(mu1, sigma1, 1000)+np.random.normal(mu2, sigma2, 1000)
sysTest=Testsystematics(x, y, NTH = 10, THMIN = 0, THMAX = 5, NRESAMPLE = 100)
any suggestion?
I'm the pathos author. I tried your code, and it runs, but produces no error and produces no result in result_list. I believe that is because you are using apipe incorrectly. The correct use of apipe is as follows:
>>> import pathos
>>> def squared(x):
... return x**2
...
>>> pool = pathos.multiprocessing.ProcessingPool()
>>> res = pool.apipe(squared, 5)
>>> res.get()
25
self.bootstrapping takes self and k, so you have to provide a k in the pipe call when you calling it as an instance method. There is no callback -- if you want a callback, you'd need to add one to your function.
Note that the return value is retrieved by (1) getting a return object, and (2) by calling get on the return object.
From you use of apipe within a for loop, that points me to suggest you use pool.amap (or pool.imap) instead -- then you can do the for loop in parallel.