Python Joblib Parallel: How to combine results per worker?

Python Joblib Parallel: How to combine results per worker? - python

Context
I have a function that produces a large 2D numpy array (with fixed shape) as output. I am calling this function 1000 times using joblib (Parallel with a multiprocessing backend) on 8 CPUs. At the end of the job, I add up all the arrays element-wise (using np.sum) to produce a single 2D array that I am interested in. However, when I attempt this, I run out of RAM. I assume that this is because the 1000 arrays would need to be stored in RAM until they are summed at the end.
Question
Is there a way to get each worker to add up its arrays as it goes? For example, worker 1 would add array 2 to array 1, and then discard array 2 before computing array 3, and so on. This way, there would only be a maximum of 8 arrays (for 8 CPUs) stored in RAM at any point in time, and these could be summed up at the end to get the same answer.

The facts that you know your arguments in advance and the time for calculation not varying much with the actual argument(s) simplifies the task. It allows for assigning complete jobs for every worker process at start and just summing up the results at the end, just how you proposed.
In the code below every spawned process gets an "equal" (as much as possible) part of all arguments (its args_batch) and sums up the intermediate results from calling the target function in it's own result-array. These arrays get summed up finally by the parent process.
The "delayed" function here in the example is not the target function which calculates an array, but a processing function (worker) to which the target function (calc_array) gets passed as part of the job along with the batch of arguments.
import numpy as np
from itertools import repeat
from time import sleep
from joblib import Parallel, delayed
def calc_array(v):
"""Create an array with specified shape and
fill it up with value v, then kill some time.
Dummy target function.
"""
new_array = np.full(shape=SHAPE, fill_value=v)
# delay result:
cnt = 10_000_000
for _ in range(cnt):
cnt -= 1
return new_array
def worker(func, args_batch):
"""Call func with every packet of arguments received and update
result array on the run.
Worker function which runs the job in each spawned process.
"""
results = np.zeros(SHAPE)
for args_ in args_batch:
new_array = func(*args_)
np.sum([results, new_array], axis=0, out=results)
return results
def main(func, arguments, n_jobs, verbose):
with Parallel(n_jobs=n_jobs, verbose=verbose) as parallel:
# bundle up jobs:
funcs = repeat(func, n_jobs) # functools.partial seems not pickle-able
args_batches = np.array_split(arguments, n_jobs, axis=0)
jobs = zip(funcs, args_batches)
result = sum(parallel(delayed(worker)(*job) for job in jobs))
assert np.all(result == sum(range(CALLS_TOTAL)))
sleep(1) # just to keep stdout ordered
print(result)
if __name__ == '__main__':
SHAPE = (4, 4) # shape of array calculated by calc_array
N_JOBS = 8
CALLS_TOTAL = 100
VERBOSE = 10
ARGUMENTS = np.asarray([*zip(range(CALLS_TOTAL))])
# array([[0], [1], [2], ...]])
# zip to bundle arguments in a container so we have less code to
# adapt when feeding a function with multiple parameters
main(func=calc_array, arguments=ARGUMENTS, n_jobs=N_JOBS, verbose=VERBOSE)

Related

code is faster on single cpu but very slow on multiple processes why?

I have some code to sort some values originally in sparse matrix and zip it together with another data. I used some kind of optimizations to make it fast and the code is 20x faster than it was as it is below:
This code takes 8s on single CPU core:
# cosine_sim is a sparse csr matrix
# names is an numpy array of length 400k
cosine_sim_labeled = []
for i in range(0, cosine_sim.shape[0]):
row = cosine_sim.getrow(i).toarray()[0]
non_zero_sim_indexes = np.nonzero(row)
non_zero_sim_values = row[non_zero_sim_indexes]
non_zero_sim_values = [round(freq, 4) for freq in non_zero_sim_values]
non_zero_names_values = np.take(names, non_zero_sim_indexes)[0]
zipped = zip(non_zero_names_values, non_zero_sim_values)
cosine_sim_labeled.append(sorted(zipped, key=lambda cv: -cv[1])[1:][:top_similar_count])
But if I use same code with multi core (to make it even faster) it takes 300 seconds:
#split is array of arrays of numbers like [[1,2,3], [4,5,6]] it is meant to generate batches of array indexes to be processed with each paralel process
split = np.array_split(range(0, cosine_sim.shape[0]), cosine_sim.shape[0] / batch)
def sort_rows(split):
cosine_sim_labeled = []
for i in split:
row = cosine_sim.getrow(i).toarray()[0]
non_zero_sim_indexes = np.nonzero(row)
non_zero_sim_values = row[non_zero_sim_indexes]
non_zero_sim_values = [round(freq, 4) for freq in non_zero_sim_values]
non_zero_names_values = np.take(names, non_zero_sim_indexes)[0]
zipped = zip(non_zero_names_values, non_zero_sim_values)
cosine_sim_labeled.append(sorted(zipped, key=lambda cv: -cv[1])[1:][:top_similar_count])
return cosine_sim_labeled
# this ensures paralel CPU execution
rows = Parallel(n_jobs=CPU_use, verbose=40)(delayed(sort_rows)(x) for x in split)
cosine_sim_labeled = np.vstack(rows).tolist()

you do realize that your new parallel function sort_rows does not even use the split argument? all it does is to distribute all the data to all processes, which takes time, then each process is doing the exact same calculation, only to return the whole data back to the main process, which again takes time

Using shared array in multiprocessing

I am trying to run a parallel process in python, wherein I have to extract certain polygons from a large array based on some conditions. The large array has 10k+ polygons that are indexed.
In a extract_polygon function I pass (array, index). Based on index the function has to either return the polygon corresponding to that index or not based on the conditions defined. The array is never changed and is only used for reading the polygon based on the index provided.
Since the array is very large, I am running into out of memory error during parallel processing. how can I avoid that? (In a way, how to effectively use shared array in multiprocessing?)
Below is my sample code:
def extract_polygon(array, index):
try:
islays = ndimage.find_objects(clone==index)
poly = clone[islays[0][0],islays[0][1]]
area = np.count_nonzero(ploy)
minArea = 100
maxArea = 10000
if (area > minArea) and (area < maxArea):
return poly
else:
return None
except:
return None
start = time.time()
pool = mp.Pool(10)
results = pool.starmap(get_objects,[(array, index) for index in indices])
pool.close()
pool.join()
#indices here is a list of all the indexes we have.
Can I use any other library like ray in this case?

You can absolutely use a library like Ray.
The structure would look something like this (simplified to remove your application logic).
import numpy as np
import ray
ray.init()
# Create the array and store it in shared memory once.
array = np.ones(10**6)
array_id = ray.put(array)
#ray.remote
def extract_polygon(array, index):
# Change this to actual extract the polygon.
return index
# Start 10 tasks that each take in the ID of the array in shared memory.
# These tasks execute in parallel (assuming there are enough CPU resources).
result_ids = [extract_polygon.remote(array_id, i) for i in range(10)]
# Fetch the results.
results = ray.get(result_ids)
You can read more about Ray in the documentation.
See some related answers below:
Shared-memory objects in multiprocessing
python3 multiprocess shared numpy array(read-only)

How to correctly implement apply_async for data processing?

I am new to using parallel processing for data analysis. I have a fairly large array and I want to apply a function to each index of said array.
Here is the code I have so far:
import numpy as np
import statsmodels.api as sm
from statsmodels.regression.quantile_regression import QuantReg
import multiprocessing
from functools import partial
def fit_model(data,q):
#data is a 1-D array holding precipitation values
years = np.arange(1895,2018,1)
res = QuantReg(exog=sm.add_constant(years),endog=data).fit(q=q)
pointEstimate = res.params[1] #output slope of quantile q
return pointEstimate
#precipAll is an array of shape (1405*621,123,12) (longitudes*latitudes,years,months)
#find all indices where there is data
nonNaN = np.where(~np.isnan(precipAll[:,0,0]))[0] #481631 indices
month = 4
#holder array for results
asyncResults = np.zeros((precipAll.shape[0])) * np.nan
def saveResult(result,pos):
asyncResults[pos] = result
if __name__ == '__main__':
pool = multiprocessing.Pool(processes=20) #my server has 24 CPUs
for i in nonNaN:
#use partial so I can also pass the index i so the result is
#stored in the expected position
new_callback_function = partial(saveResult, pos=i)
pool.apply_async(fit_model, args=(precipAll[i,:,month],0.9),callback=new_callback_function)
pool.close()
pool.join()
When I ran this, I stopped it after it took longer than had I not used multiprocessing at all. The function, fit_model, is on the order of 0.02 seconds, so could the overhang associated with apply_async be causing the slowdown? I need to maintain order of the results as I am plotting this data onto a map after this processing is done. Any thoughts on where I need improvement is greatly appreciated!

If you need to use the multiprocessing module, you'll probably want to batch more rows together into each task that you give to the worker pool. However, for what you're doing, I'd suggest trying out Ray due to its efficient handling of large numerical data.
import numpy as np
import statsmodels.api as sm
from statsmodels.regression.quantile_regression import QuantReg
import ray
#ray.remote
def fit_model(precip_all, i, month, q):
data = precip_all[i,:,month]
years = np.arange(1895, 2018, 1)
res = QuantReg(exog=sm.add_constant(years), endog=data).fit(q=q)
pointEstimate = res.params[1]
return pointEstimate
if __name__ == '__main__':
ray.init()
# Create an array and place it in shared memory so that the workers can
# access it (in a read-only fashion) without creating copies.
precip_all = np.zeros((100, 123, 12))
precip_all_id = ray.put(precip_all)
result_ids = []
for i in range(precip_all.shape[0]):
result_ids.append(fit_model.remote(precip_all_id, i, 4, 0.9))
results = np.array(ray.get(result_ids))
Some Notes
The example above runs out of the box, but note that I simplified the logic a bit. In particular, I removed the handling of NaNs.
On my laptop with 4 physical cores, this takes about 4 seconds. If you use 20 cores instead and make the data 9000 times bigger, I'd expect it to take about 7200 seconds, which is quite a long time. One possible approach to speeding this up is to use more machines or to process multiple rows in each call to fit_model in order to amortize some of the overhead.
The above example actually passes the entire precip_all matrix into each task. This is fine because each fit_model task only has read access to a copy of the matrix stored in shared memory and so doesn't need to create its own local copy. The call to ray.put(precip_all) places the array in shared memory once up front.
For about the differences between Ray and Python multiprocessing. Note I'm helping develop Ray.

How to get return value from thread in Python?

I do some computationally expensive tasks in python and found the thread module for parallelization. I have a function which does the computation and returns a ndarray as result. Now I want to know how I can parallize my function and get back the calculated Arrays from each thread.
The followed example is strongly simplified with light functions and calculations.
import numpy as np
def calculate_result(input):
a=np.linspace(1.0, 1000.0, num=10000) # just an example
result = input*a
return(result)
input =[1,2,3,4]
for i in range(0,len(input(i))):
t.Thread(target=calculate_result, args=(input))
t. start()
#Here I want to receive the return value from the thread
I am looking for a way to get the return value from the thread / function for each thread, because in my task each thread calculates different values.
I found an other Question (how to get the return value from a thread in python?) where someone is looking for a similar problem (no ndarrays) and which is handled with ThreadPool and async...
-------------------------------------------------------------------------------
Thanks for your answers !
Due to your help now I am looking for a way to solve my problem with the multiprocessing modul. To give you a better understanding what I do, see my following Explanation.
Explanation:
My 'input_data' is an ndarray with 282240 elements of type uint32
In the 'calculation_function()'I use a for loop to calculate from
every 12 bit a result and put it into the 'output_data'
Because this is very slow, I split my input_data into e.g. 4 or 8
parts and calculate each part in the calculation_function().
Now I am looking for a way, how to parallize the 4 or 8 function
calls
The order of the data is elementary, because the data is in image and
each pixel have to be at the correct Position. So function call no. 1
calculates the first and the last function call the last pixel of the
image.
The calculations work fine and the image can be completly rebuilt
from my algo but I need the parallelization to speed up for time
critical aspects.
Summary:
One input ndarray is devided into 4 or 8 parts. In each part are 70560 or 35280 uint32 values. From each 12 bit I calculate one Pixel with 4 or 8 function calls. Each function returns one ndarray with 188160 or 94080 pixel. All return values will be put together in a row and reshaped into an image.
What allready works:
Calculations are allready working and I can reconstruct my image
Problem:
Function calls are done seriall and in a row but each image reconstruction is very slow
Main Goal:
Speed up the function calls by parallize the function calls.
Code:
def decompress(payload,WIDTH,HEIGHT):
# INPUTS / OUTPUTS
n_threads = 4
img_input = np.fromstring(payload, dtype='uint32')
img_output = np.zeros((WIDTH * HEIGHT), dtype=np.uint32)
n_elements_part = np.int(len(img_input) / n_threads)
input_part=np.zeros((n_threads,n_elements_part)).astype(np.uint32)
output_part =np.zeros((n_threads,np.int(n_elements_part/3*8))).astype(np.uint32)
# DEFINE PARTS (here 4 different ones)
start = np.zeros(n_threads).astype(np.int)
end = np.zeros(n_threads).astype(np.int)
for i in range(0,n_threads):
start[i] = i * n_elements_part
end[i] = (i+1) * n_elements_part -1
# COPY IMAGE DATA
for idx in range(0,n_threads):
input_part [idx,:] = img_input[start[idx]:end[idx]+1]
for idx in range(0,n_threads): # following line is the function_call that should be parallized
output_part[idx,:] = decompress_part2(input_part[idx],output_part[idx])
# COPY PARTS INTO THE IMAGE
img_output[0 : 188160] = output_part[0,:]
img_output[188160: 376320] = output_part[1,:]
img_output[376320: 564480] = output_part[2,:]
img_output[564480: 752640] = output_part[3,:]
# RESHAPE IMAGE
img_output = np.reshape(img_output,(HEIGHT, WIDTH))
return img_output
Please dont take care of my beginner programming style :)
Just looking for a solution how to parallize the function calls with the multiprocessing module and get back the return ndarrays.
Thank you so much for your help !

You can use process pool from the multiprocessing module
def test(a):
return a
from multiprocessing.dummy import Pool
p = Pool(3)
a=p.starmap(test, zip([1,2,3]))
print(a)
p.close()
p.join()

kar's answer works, however keep in mind that he's using the .dummy module which might be limited by the GIL. Heres more info on it:
multiprocessing.dummy in Python is not utilising 100% cpu

ipython map_async input and output data

I am new to the IPython parallel package but really want to get it going. What I have is a 4D numpy array which I want to run through slices,rows,columns and process the 4th dimension (time). The processing is a minimization routine that takes a bit of time which is why I would like to parallelize it.
from IPython.parallel import Client
from numpy import *
from matplotlib.pylab import *
c = Client()
v = c.load_balanced_view()
v.block=False
def process( src, freq, d ):
# Get slice, row, col
sl,r,c = src
# Get data
mm = d[:,sl,c,r]
# Call fitting routine
<fiting routine that requires freq, mm and outputs multiple parameters>
return <output parameters??>
## Create the mask of what we are going to process
mask = zeros(d[0].shape)
mask[sl][ nonzero( d[0,sl] > 10*median(d[0]) ) ] = 1
# find all non-zero points in the mask
points = array(nonzero( mask == 1)).transpose()
# Call async
asyncresult = v.map_async( process, points, freq=freq, d=d )
My function "process" requires two parameters: 1) freq is a numpy array (100,1) and 2) d which is (100, 50, 110, 110) or so. I want to retrieve several parameters from the fitting.
All the examples I have seen that use map_async have simple lambda functions etc and the outputs seem to be trivial.
What I want is to apply "process" to every point in d where the mask is not zero and to have maps of the output parameters in the same space. [Added: I am getting "process() takes exactly 3 arguments (1 given) ].
(Step 2 of this might be required as I am passing a huge numpy array "d" to each process. But once I figure out the data passing I should hopefully be able to figure out a more efficient way of doing this.)
Thanks for any help.

I got around the data passing problem by doing
def mapper(x):
return apply(x[0], x[1:])
And calling map_async with a list of tuples where the first element is my function and the rest of the elements are the parameters to my function.
asyncResult = pool.map_async(mapper, [(func, arg1, arg2) for arg1, arg2 in myArgs])
I tried a lambda first but apparently that couldn't be pickled so that was a no go.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python Joblib Parallel: How to combine results per worker? - python

Related

code is faster on single cpu but very slow on multiple processes why?

Using shared array in multiprocessing

How to correctly implement apply_async for data processing?

How to get return value from thread in Python?

ipython map_async input and output data

Categories

Resources