skimage.util.apply_parallel not behaving as expected - python

I'm trying to tile an image and apply calculations to each tile in parallel. But it's not behaving like I expect. I made it print 'here' whenever the function is executed, and many are printed quickly, which indicates they're being launched simultaneously. But my cpu load in my task manager never gets above 100%, and it takes a long time to execute. Can someone please advise? This is my first time using skimage.util.apply_parallel, which uses Dask.
from numpy import random, ones, zeros_like
from skimage.util import apply_parallel
def f3(im):
print('here')
for _ in range(10000):
u=random.random(100000)
return zeros_like(im)
if __name__=='__main__':
im=ones((2,4))
f = lambda img: f3(img)
im2=apply_parallel(f,im,chunks=1)

I looked at the source code, and apply_parallel relies on this Dask command:
res = darr.map_overlap(wrapped_func, depth, boundary=mode, dtype=dtype)
But I found that it needs .compute('processes') at the end of it to guarantee multiple cpu's. So now I'm just using Dask itself:
import dask.array as da
im2 = da.from_array(im,chunks=2)
proc = im2.map_overlap(f, depth=0).compute(scheduler='processes')
Then the cpu usage really jumps!

Related

Avoiding IO time delay in a loop using multiprocessing

I am running prediction using a trained tensorflow model and generating data using it on the images coming from a simulator. But the issue here I need to save image too for each prediction I am making which is creating delay in the loop sometime causing issues in simulator. Is there any way we can use python's multiprocessing module to create a producer consumer architecture to avoid the IO cost in the loop?
for data in data_arr:
speed=float(data['speed'])
image=Image.open(BytesIO(base64.b64decode(data['image'])))
image=np.asarray(image)
img_c=image.copy()
image=img_preprocess(image)
image=np.array([image])
steering_angle=float(model_steer.predict(image))
#throttle=float(model_thr.predict(image))
throttle=1.0-speed/speed_limit
save_image(img_c,steering_angle)
print('{} {} {}'.format(steering_angle,throttle,speed))
send_control(steering_angle,throttle)
I tried to experiment similar concept for processing images from color to grayscale but instead of decreasing time. The total time increased from 0.1 sec to 17 sec.
import numpy as np
import cv2
import os
import time
from multiprocessing import Pool,RawArray
import ctypes
files_path=os.listdir('./imgs/')
files_path=list(map(lambda x:'./imgs/'+x,files_path))
temp_img=np.zeros((160,320))
var_dict = {}
def init_worker(X, h,w):
# Using a dictionary is not strictly necessary. You can also
# use global variables.
var_dict['X']=X
var_dict['h'] = h
var_dict['w'] = w
def worker_func(idx):
# Simply computes the sum of the i-th row of the input matrix X
X_np = np.frombuffer(var_dict['X'], dtype=np.uint8)
X_np=X_np.reshape(var_dict['h'],var_dict['w'])
cv2.imwrite('./out/'+str(idx)+'.jpg',X_np)
if __name__=='__main__':
start_time=time.time()
for idx,filepath in enumerate(files_path):
img=cv2.imread(filepath)
img=cv2.cvtColor(img,cv2.COLOR_BGR2GRAY)
h,w=img.shape[:2]
mulproc_array=RawArray(ctypes.c_uint8,160*320)
X_np = np.frombuffer(mulproc_array, dtype=np.uint8).reshape(160,320)
np.copyto(X_np,img)
#cv2.imwrite('./out/'+str(idx)+'.jpg',img)
with Pool(processes=1, initializer=init_worker, initargs=(mulproc_array, h,w)) as pool:
pool.map(worker_func,[idx])
end_time=time.time()
print('Time taken=',(end_time-start_time))
there is no reason for using RawArray, as multiprocessing will already use pickle for objects transfer which has approximately the same size as the numpy array, and using RawArray is different from your use case.
you don't need to wait for the saving function to end, you can run it asynchronously.
you shouldn't be closing the pool until you are done with everything, as creating a worker takes a very long time (in the order of 10-100ms)
def worker_func(img,idx):
cv2.imwrite('./out/'+str(idx)+'.jpg',img)
if __name__=='__main__':
start_time=time.time()
with Pool(processes=1) as pool:
results = []
for idx,filepath in enumerate(files_path):
img=cv2.imread(filepath)
img=cv2.cvtColor(img,cv2.COLOR_BGR2GRAY) # do other work here
# next line converts image to uint8 before sending it to reduce its size
results.append(pool.apply_async(worker_func,args=(img.astype(np.uint8),idx)))
end_time=time.time() # technically the transfer is done, at this line.
for res in results:
res.get() # call this before closing the pool to make sure all images are saved.
print('Time taken=',(end_time-start_time))
you might want to experiment with threading instead of multiprocessing, to avoid data copy altogether, since writing to disk drops the GIL, but the results are not guaranteed to be faster.

Fastest way to run a single function in python in parallel for multiple parameters

Suppose I have a single function processing. I want to run the same function multiple times for multiple parameters parallelly instead of sequentially one after the other.
def processing(image_location):
image = rasterio.open(image_location)
...
...
return(result)
#calling function serially one after the other with different parameters and saving the results to a variable.
results1 = processing(r'/home/test/image_1.tif')
results2 = processing(r'/home/test/image_2.tif')
results3 = processing(r'/home/test/image_3.tif')
For example, If I run delineation(r'/home/test/image_1.tif') then delineation(r'/home/test/image_2.tif') and then delineation(r'/home/test/image_3.tif'), as shown in the above code, it will run sequentially one after the other and if it takes 5 minutes for one function to run then running these three will take 5x3=15 minutes. Hence, I am wondering if I can run these three parallelly/embarrassingly parallel so that it takes only 5 minutes to execute the function for all the three different parameters.
Help me with the fastest way to do this job. The script should be able to utilize all the resources/CPU/ram available by default to do this task.
You can use multiprocessing to execute functions in parallel and save results to results variable:
from multiprocessing.pool import ThreadPool
pool = ThreadPool()
images = [r'/home/test/image_1.tif', r'/home/test/image_2.tif', r'/home/test/image_3.tif']
results = pool.map(delineation, images)
You might want to take a look at IPython Parallel. It allows you to easily run functions on a load-balanced (local) cluster.
For this little example, make sure you have IPython Parallel, NumPy and Pillow installed. To run the the example, you need first to launch the cluster. To launch a local cluster with four parallel engines, type into a terminal (one engine for one processor core seems a reasonable choice):
ipcluster 4
Then you can run the following script, which searches for jpg-images in a given directory and counts the number of pixels in each image:
import ipyparallel as ipp
rc = ipp.Client()
with rc[:].sync_imports(): # import on all engines
import numpy
from pathlib import Path
from PIL import Image
lview = rc.load_balanced_view() # default load-balanced view
lview.block = True # block until map() is finished
#lview.parallel()
def count_pixels(fn: Path):
"""Silly function to count the number of pixels in an image file"""
im = Image.open(fn)
xx = numpy.asarray(im)
num_pixels = xx.shape[0] * xx.shape[1]
return fn.stem, num_pixels
pic_dir = Path('Pictures')
fn_lst = pic_dir.glob('*.jpg') # list all jpg-files in pic_dir
results = count_pixels.map(fn_lst) # execute in parallel
for n_, cnt in results:
print(f"'{n_}' has {cnt} pixels.")
Another way of writing with the multiprocessing library (see #Alderven for a different function).
import multiprocessing as mp
def calculate(input_args):
result = input_args * 2
return result
N = mp.cpu_count()
parallel_input = np.arange(0, 100)
print('Amount of CPUs ', N)
print('Amount of iterations ', len(parallel_input))
with mp.Pool(processes=N) as p:
results = p.map(calculate, list(parallel_input))
The results variable will contain a list with your processed data. Which you are then able to write.
I think one of the easiest methods is using joblib:
import joblib
allJobs = []
allJobs.append(joblib.delayed(processing)(r'/home/test/image_1.tif'))
allJobs.append(joblib.delayed(processing)(r'/home/test/image_2.tif'))
allJobs.append(joblib.delayed(processing)(r'/home/test/image_3.tif'))
results = joblib.Parallel(n_jobs=joblib.cpu_count(), verbose=10)(allJobs)

Mismatch between parallelized and linear nested for loops

I want to parallelize a piece of code that resembles the following:
Ngal=10
sampind=[7,16,22,31,45]
samples=0.3*np.ones((60,Ngal))
zt=[2.15,7.16,1.23,3.05,4.1,2.09,1.324,3.112,0.032,0.2356]
toavg=[]
for j in range(Ngal):
gal=[]
for m in sampind:
gal.append(samples[m][j]-zt[j])
toavg.append(np.mean(gal))
accuracy=np.mean(toavg)
so I followed the advice here and I rewrote it as follows:
toavg=[]
gal=[]
p = mp.Pool()
def deltaz(params):
j=params[0] # index of the galaxy
m=params[1] # indices for which we have sampled redshifts
gal.append(samples[m][j]-zt[j])
return np.mean(gal)
j=(np.linspace(0,Ngal-1,Ngal).astype(int))
m=sampind
grid=[j,m]
input=itertools.product(*grid)
results = p.map(deltaz,input)
accuracy=np.mean(results)
p.close()
p.join()
but the results are not the same. In fact, sometimes they are, sometimes they're not. It doesn't seem very deterministic. Is my approach correct? If not, what should I fix? Thank you! The modules that you will need to reproduce the above examples are:
import numpy as np
import multiprocess as mp
import itertools
Thank you!
The first issue I see is that you are creating a global variable gal which is being accessed by the function deltaz. These are however not shared between the pool processes but instantiated for each process separately. You will have to use shared memory if you want them to share this structure. This is probably why you see a non-deterministic behavior.
The next issue is that you are not actually completing the same tasking with the different variation. The first one you are taking an average of each set of averages (gal). The parallel one is taking an average of which ever elements happen to end up in that list. This is nondeterministic because items are assigned to processes as they become available and this is not necessarily predictable.
I would suggest parallelizing the inner loop. To do this, you need zt and samples to both be in shared memory because they are accessed by all of the processes. This can get dangerous if you are modifying data but since you appear to only be reading it should be fine.
import numpy as np
import multiprocessing as mp
import itertools
import ctypes
#Non-parallel code
Ngal=10
sampind=[7,16,22,31,45]
samples=0.3*np.ones((60,Ngal))
zt=[2.15,7.16,1.23,3.05,4.1,2.09,1.324,3.112,0.032,0.2356]
#Nonparallel
toavg=[]
for j in range(Ngal):
gal=[]
for m in sampind:
gal.append(samples[m][j]-zt[j])
toavg.append(np.mean(gal))
accuracy=np.mean(toavg)
print(toavg)
# Parallel function
def deltaz(j):
sampind=[7,16,22,31,45]
gal = []
for m in sampind:
gal.append(samples[m][j]-zt[j])
return np.mean(gal)
# Shared array for zt
zt_base = mp.Array(ctypes.c_double, int(len(zt)),lock=False)
ztArr = np.ctypeslib.as_array(zt_base)
#Shared array for samples
sample_base = mp.Array(ctypes.c_double, int(np.product(samples.shape)),lock=False)
sampArr = np.ctypeslib.as_array(sample_base)
sampArr = sampArr.reshape(samples.shape)
#Copy arrays to shared
sampArr[:,:] = samples[:,:]
ztArr[:] = zt[:]
with mp.Pool() as p:
result = p.map(deltaz,(np.linspace(0,Ngal-1,Ngal).astype(int)))
print(result)
Here is an example that produces the same results. You can add more complexity to this as you see fit but I would read about multiprocessing in general and memory types/scopes to get an idea of what will and won't work. You have to take more care when you get into the multiprocessing world. Let me know if this doesn't help and I will try to update it so that it does.

Why this multiprocessing code is slower than the serial one?

I tried the following python programs, both sequential and parallel versions on a cluster computing facility. I could clearly see(using top command) more processes initiating for the parallel program. But when I time it, it seems the parallel version is taking more time. What could be the reason? I am attaching the codes and the timing info herewith.
#parallel.py
from multiprocessing import Pool
import numpy
def sqrt(x):
return numpy.sqrt(x)
pool = Pool()
results = pool.map(sqrt, range(100000), chunksize=10)
#seq.py
import numpy
def sqrt(x):
return numpy.sqrt(x)
results = [sqrt(x) for x in range(100000)]
user#domain$ time python parallel.py > parallel.txt
real 0m1.323s
user 0m2.238s
sys 0m0.243s
user#domain$ time python seq.py > seq.txt
real 0m0.348s
user 0m0.324s
sys 0m0.024s
The amount of work per task is by far too little to compensate for the work-distribution-overhead. First you should increase the chunksize, but still a single square root operation is too short to compensate for the cost of sending around the data between processes. You can see an effective speedup from something like this:
def sqrt(x):
for _ in range(100):
x = numpy.sqrt(x)
return x
results = pool.map(sqrt, range(10000), chunksize=100)

Parfor for Python

I am looking for a definitive answer to MATLAB's parfor for Python (Scipy, Numpy).
Is there a solution similar to parfor? If not, what is the complication for creating one?
UPDATE: Here is a typical numerical computation code that I need speeding up
import numpy as np
N = 2000
output = np.zeros([N,N])
for i in range(N):
for j in range(N):
output[i,j] = HeavyComputationThatIsThreadSafe(i,j)
An example of a heavy computation function is:
import scipy.optimize
def HeavyComputationThatIsThreadSafe(i,j):
n = i * j
return scipy.optimize.anneal(lambda x: np.sum((x-np.arange(n)**2)), np.random.random((n,1)))[0][0,0]
The one built-in to python would be multiprocessing docs are here. I always use multiprocessing.Pool with as many workers as processors. Then whenever I need to do a for-loop like structure I use Pool.imap
As long as the body of your function does not depend on any previous iteration then you should have near linear speed-up. This also requires that your inputs and outputs are pickle-able but this is pretty easy to ensure for standard types.
UPDATE:
Some code for your updated function just to show how easy it is:
from multiprocessing import Pool
from itertools import product
output = np.zeros((N,N))
pool = Pool() #defaults to number of available CPU's
chunksize = 20 #this may take some guessing ... take a look at the docs to decide
for ind, res in enumerate(pool.imap(Fun, product(xrange(N), xrange(N))), chunksize):
output.flat[ind] = res
There are many Python frameworks for parallel computing. The one I happen to like most is IPython, but I don't know too much about any of the others. In IPython, one analogue to parfor would be client.MultiEngineClient.map() or some of the other constructs in the documentation on quick and easy parallelism.
Jupyter Notebook
To see an example consider you want to write the equivalence of this Matlab code on in Python
matlabpool open 4
parfor n=0:9
for i=1:10000
for j=1:10000
s=j*i
end
end
n
end
disp('done')
The way one may write this in python particularly in jupyter notebook. You have to create a function in the working directory (I called it FunForParFor.py) which has the following
def func(n):
for i in range(10000):
for j in range(10000):
s=j*i
print(n)
Then I go to my Jupyter notebook and write the following code
import multiprocessing
import FunForParFor
if __name__ == '__main__':
pool = multiprocessing.Pool(processes=4)
pool.map(FunForParFor.func, range(10))
pool.close()
pool.join()
print('done')
This has worked for me! I just wanted to share it here to give you a particular example.
This can be done elegantly with Ray, a system that allows you to easily parallelize and distribute your Python code.
To parallelize your example, you'd need to define your functions with the #ray.remote decorator, and then invoke them with .remote.
import numpy as np
import time
import ray
ray.init()
# Define the function. Each remote function will be executed
# in a separate process.
#ray.remote
def HeavyComputationThatIsThreadSafe(i, j):
n = i*j
time.sleep(0.5) # Simulate some heavy computation.
return n
N = 10
output_ids = []
for i in range(N):
for j in range(N):
# Remote functions return a future, i.e, an identifier to the
# result, rather than the result itself. This allows invoking
# the next remote function before the previous finished, which
# leads to the remote functions being executed in parallel.
output_ids.append(HeavyComputationThatIsThreadSafe.remote(i,j))
# Get results when ready.
output_list = ray.get(output_ids)
# Move results into an NxN numpy array.
outputs = np.array(output_list).reshape(N, N)
# This program should take approximately N*N*0.5s/p, where
# p is the number of cores on your machine, N*N
# is the number of times we invoke the remote function,
# and 0.5s is the time it takes to execute one instance
# of the remote function. For example, for two cores this
# program will take approximately 25sec.
There are a number of advantages of using Ray over the multiprocessing module. In particular, the same code will run on a single machine as well as on a cluster of machines. For more advantages of Ray see this related post.
Note: One point to keep in mind is that each remote function is executed in a separate process, possibly on a different machine, and thus the remote function's computation should take more than invoking a remote function. As a rule of thumb a remote function's computation should take at least a few 10s of msec to amortize the scheduling and startup overhead of a remote function.
I've always used Parallel Python but it's not a complete analog since I believe it typically uses separate processes which can be expensive on certain operating systems. Still, if the body of your loops are chunky enough then this won't matter and can actually have some benefits.
I tried all solutions here, but found that the simplest way and closest equivalent to matlabs parfor is numba's prange.
Essentially you change a single letter in your loop, range to prange:
from numba import autojit, prange
#autojit
def parallel_sum(A):
sum = 0.0
for i in prange(A.shape[0]):
sum += A[i]
return sum
I recommend trying joblib Parallel.
one liner
from joblib import Parallel, delayed
out = Parallel(n_jobs=2)(delayed(heavymethod)(i) for i in range(10))
instructional
instead of taking a for loop
from time import sleep
for _ in range(10):
sleep(.2)
rewrite your operation into a list comprehension
[sleep(.2) for _ in range(10)]
Now let us not directly evaluate the expression, but collect what should be done.
This is what the delayed method is for.
from joblib import delayed
[delayed(sleep(.2)) for _ in range(10)]
Next instantiate a parallel process with n_workers and process the list.
from joblib import Parallel
r = Parallel(n_jobs=2, verbose=10)(delayed(sleep)(.2) for _ in range(10))
[Parallel(n_jobs=2)]: Done 1 tasks | elapsed: 0.6s
[Parallel(n_jobs=2)]: Done 4 tasks | elapsed: 0.8s
[Parallel(n_jobs=2)]: Done 10 out of 10 | elapsed: 1.4s finished
Ok, I'll also give it a go, let's see if my way is easier
from multiprocessing import Pool
def heavy_func(key):
#do some heavy computation on each key
output = key**2
return key, output
output_data ={} #<--this dict will store the results
keys = [1,5,7,8,10] #<--compute heavy_func over all the values of keys
with Pool(processes=40) as pool:
for i in pool.imap_unordered(heavy_func, keys):
output_data[i[0]] = i[1]
Now output_data is a dictionary that will contain for every key the result of the computation on this key.
That is it..

Categories