I am new to the IPython parallel package but really want to get it going. What I have is a 4D numpy array which I want to run through slices,rows,columns and process the 4th dimension (time). The processing is a minimization routine that takes a bit of time which is why I would like to parallelize it.
from IPython.parallel import Client
from numpy import *
from matplotlib.pylab import *
c = Client()
v = c.load_balanced_view()
v.block=False
def process( src, freq, d ):
# Get slice, row, col
sl,r,c = src
# Get data
mm = d[:,sl,c,r]
# Call fitting routine
<fiting routine that requires freq, mm and outputs multiple parameters>
return <output parameters??>
## Create the mask of what we are going to process
mask = zeros(d[0].shape)
mask[sl][ nonzero( d[0,sl] > 10*median(d[0]) ) ] = 1
# find all non-zero points in the mask
points = array(nonzero( mask == 1)).transpose()
# Call async
asyncresult = v.map_async( process, points, freq=freq, d=d )
My function "process" requires two parameters: 1) freq is a numpy array (100,1) and 2) d which is (100, 50, 110, 110) or so. I want to retrieve several parameters from the fitting.
All the examples I have seen that use map_async have simple lambda functions etc and the outputs seem to be trivial.
What I want is to apply "process" to every point in d where the mask is not zero and to have maps of the output parameters in the same space. [Added: I am getting "process() takes exactly 3 arguments (1 given) ].
(Step 2 of this might be required as I am passing a huge numpy array "d" to each process. But once I figure out the data passing I should hopefully be able to figure out a more efficient way of doing this.)
Thanks for any help.
I got around the data passing problem by doing
def mapper(x):
return apply(x[0], x[1:])
And calling map_async with a list of tuples where the first element is my function and the rest of the elements are the parameters to my function.
asyncResult = pool.map_async(mapper, [(func, arg1, arg2) for arg1, arg2 in myArgs])
I tried a lambda first but apparently that couldn't be pickled so that was a no go.
Related
I am new to using parallel processing for data analysis. I have a fairly large array and I want to apply a function to each index of said array.
Here is the code I have so far:
import numpy as np
import statsmodels.api as sm
from statsmodels.regression.quantile_regression import QuantReg
import multiprocessing
from functools import partial
def fit_model(data,q):
#data is a 1-D array holding precipitation values
years = np.arange(1895,2018,1)
res = QuantReg(exog=sm.add_constant(years),endog=data).fit(q=q)
pointEstimate = res.params[1] #output slope of quantile q
return pointEstimate
#precipAll is an array of shape (1405*621,123,12) (longitudes*latitudes,years,months)
#find all indices where there is data
nonNaN = np.where(~np.isnan(precipAll[:,0,0]))[0] #481631 indices
month = 4
#holder array for results
asyncResults = np.zeros((precipAll.shape[0])) * np.nan
def saveResult(result,pos):
asyncResults[pos] = result
if __name__ == '__main__':
pool = multiprocessing.Pool(processes=20) #my server has 24 CPUs
for i in nonNaN:
#use partial so I can also pass the index i so the result is
#stored in the expected position
new_callback_function = partial(saveResult, pos=i)
pool.apply_async(fit_model, args=(precipAll[i,:,month],0.9),callback=new_callback_function)
pool.close()
pool.join()
When I ran this, I stopped it after it took longer than had I not used multiprocessing at all. The function, fit_model, is on the order of 0.02 seconds, so could the overhang associated with apply_async be causing the slowdown? I need to maintain order of the results as I am plotting this data onto a map after this processing is done. Any thoughts on where I need improvement is greatly appreciated!
If you need to use the multiprocessing module, you'll probably want to batch more rows together into each task that you give to the worker pool. However, for what you're doing, I'd suggest trying out Ray due to its efficient handling of large numerical data.
import numpy as np
import statsmodels.api as sm
from statsmodels.regression.quantile_regression import QuantReg
import ray
#ray.remote
def fit_model(precip_all, i, month, q):
data = precip_all[i,:,month]
years = np.arange(1895, 2018, 1)
res = QuantReg(exog=sm.add_constant(years), endog=data).fit(q=q)
pointEstimate = res.params[1]
return pointEstimate
if __name__ == '__main__':
ray.init()
# Create an array and place it in shared memory so that the workers can
# access it (in a read-only fashion) without creating copies.
precip_all = np.zeros((100, 123, 12))
precip_all_id = ray.put(precip_all)
result_ids = []
for i in range(precip_all.shape[0]):
result_ids.append(fit_model.remote(precip_all_id, i, 4, 0.9))
results = np.array(ray.get(result_ids))
Some Notes
The example above runs out of the box, but note that I simplified the logic a bit. In particular, I removed the handling of NaNs.
On my laptop with 4 physical cores, this takes about 4 seconds. If you use 20 cores instead and make the data 9000 times bigger, I'd expect it to take about 7200 seconds, which is quite a long time. One possible approach to speeding this up is to use more machines or to process multiple rows in each call to fit_model in order to amortize some of the overhead.
The above example actually passes the entire precip_all matrix into each task. This is fine because each fit_model task only has read access to a copy of the matrix stored in shared memory and so doesn't need to create its own local copy. The call to ray.put(precip_all) places the array in shared memory once up front.
For about the differences between Ray and Python multiprocessing. Note I'm helping develop Ray.
Context
I have a function that produces a large 2D numpy array (with fixed shape) as output. I am calling this function 1000 times using joblib (Parallel with a multiprocessing backend) on 8 CPUs. At the end of the job, I add up all the arrays element-wise (using np.sum) to produce a single 2D array that I am interested in. However, when I attempt this, I run out of RAM. I assume that this is because the 1000 arrays would need to be stored in RAM until they are summed at the end.
Question
Is there a way to get each worker to add up its arrays as it goes? For example, worker 1 would add array 2 to array 1, and then discard array 2 before computing array 3, and so on. This way, there would only be a maximum of 8 arrays (for 8 CPUs) stored in RAM at any point in time, and these could be summed up at the end to get the same answer.
The facts that you know your arguments in advance and the time for calculation not varying much with the actual argument(s) simplifies the task. It allows for assigning complete jobs for every worker process at start and just summing up the results at the end, just how you proposed.
In the code below every spawned process gets an "equal" (as much as possible) part of all arguments (its args_batch) and sums up the intermediate results from calling the target function in it's own result-array. These arrays get summed up finally by the parent process.
The "delayed" function here in the example is not the target function which calculates an array, but a processing function (worker) to which the target function (calc_array) gets passed as part of the job along with the batch of arguments.
import numpy as np
from itertools import repeat
from time import sleep
from joblib import Parallel, delayed
def calc_array(v):
"""Create an array with specified shape and
fill it up with value v, then kill some time.
Dummy target function.
"""
new_array = np.full(shape=SHAPE, fill_value=v)
# delay result:
cnt = 10_000_000
for _ in range(cnt):
cnt -= 1
return new_array
def worker(func, args_batch):
"""Call func with every packet of arguments received and update
result array on the run.
Worker function which runs the job in each spawned process.
"""
results = np.zeros(SHAPE)
for args_ in args_batch:
new_array = func(*args_)
np.sum([results, new_array], axis=0, out=results)
return results
def main(func, arguments, n_jobs, verbose):
with Parallel(n_jobs=n_jobs, verbose=verbose) as parallel:
# bundle up jobs:
funcs = repeat(func, n_jobs) # functools.partial seems not pickle-able
args_batches = np.array_split(arguments, n_jobs, axis=0)
jobs = zip(funcs, args_batches)
result = sum(parallel(delayed(worker)(*job) for job in jobs))
assert np.all(result == sum(range(CALLS_TOTAL)))
sleep(1) # just to keep stdout ordered
print(result)
if __name__ == '__main__':
SHAPE = (4, 4) # shape of array calculated by calc_array
N_JOBS = 8
CALLS_TOTAL = 100
VERBOSE = 10
ARGUMENTS = np.asarray([*zip(range(CALLS_TOTAL))])
# array([[0], [1], [2], ...]])
# zip to bundle arguments in a container so we have less code to
# adapt when feeding a function with multiple parameters
main(func=calc_array, arguments=ARGUMENTS, n_jobs=N_JOBS, verbose=VERBOSE)
I do some computationally expensive tasks in python and found the thread module for parallelization. I have a function which does the computation and returns a ndarray as result. Now I want to know how I can parallize my function and get back the calculated Arrays from each thread.
The followed example is strongly simplified with light functions and calculations.
import numpy as np
def calculate_result(input):
a=np.linspace(1.0, 1000.0, num=10000) # just an example
result = input*a
return(result)
input =[1,2,3,4]
for i in range(0,len(input(i))):
t.Thread(target=calculate_result, args=(input))
t. start()
#Here I want to receive the return value from the thread
I am looking for a way to get the return value from the thread / function for each thread, because in my task each thread calculates different values.
I found an other Question (how to get the return value from a thread in python?) where someone is looking for a similar problem (no ndarrays) and which is handled with ThreadPool and async...
-------------------------------------------------------------------------------
Thanks for your answers !
Due to your help now I am looking for a way to solve my problem with the multiprocessing modul. To give you a better understanding what I do, see my following Explanation.
Explanation:
My 'input_data' is an ndarray with 282240 elements of type uint32
In the 'calculation_function()'I use a for loop to calculate from
every 12 bit a result and put it into the 'output_data'
Because this is very slow, I split my input_data into e.g. 4 or 8
parts and calculate each part in the calculation_function().
Now I am looking for a way, how to parallize the 4 or 8 function
calls
The order of the data is elementary, because the data is in image and
each pixel have to be at the correct Position. So function call no. 1
calculates the first and the last function call the last pixel of the
image.
The calculations work fine and the image can be completly rebuilt
from my algo but I need the parallelization to speed up for time
critical aspects.
Summary:
One input ndarray is devided into 4 or 8 parts. In each part are 70560 or 35280 uint32 values. From each 12 bit I calculate one Pixel with 4 or 8 function calls. Each function returns one ndarray with 188160 or 94080 pixel. All return values will be put together in a row and reshaped into an image.
What allready works:
Calculations are allready working and I can reconstruct my image
Problem:
Function calls are done seriall and in a row but each image reconstruction is very slow
Main Goal:
Speed up the function calls by parallize the function calls.
Code:
def decompress(payload,WIDTH,HEIGHT):
# INPUTS / OUTPUTS
n_threads = 4
img_input = np.fromstring(payload, dtype='uint32')
img_output = np.zeros((WIDTH * HEIGHT), dtype=np.uint32)
n_elements_part = np.int(len(img_input) / n_threads)
input_part=np.zeros((n_threads,n_elements_part)).astype(np.uint32)
output_part =np.zeros((n_threads,np.int(n_elements_part/3*8))).astype(np.uint32)
# DEFINE PARTS (here 4 different ones)
start = np.zeros(n_threads).astype(np.int)
end = np.zeros(n_threads).astype(np.int)
for i in range(0,n_threads):
start[i] = i * n_elements_part
end[i] = (i+1) * n_elements_part -1
# COPY IMAGE DATA
for idx in range(0,n_threads):
input_part [idx,:] = img_input[start[idx]:end[idx]+1]
for idx in range(0,n_threads): # following line is the function_call that should be parallized
output_part[idx,:] = decompress_part2(input_part[idx],output_part[idx])
# COPY PARTS INTO THE IMAGE
img_output[0 : 188160] = output_part[0,:]
img_output[188160: 376320] = output_part[1,:]
img_output[376320: 564480] = output_part[2,:]
img_output[564480: 752640] = output_part[3,:]
# RESHAPE IMAGE
img_output = np.reshape(img_output,(HEIGHT, WIDTH))
return img_output
Please dont take care of my beginner programming style :)
Just looking for a solution how to parallize the function calls with the multiprocessing module and get back the return ndarrays.
Thank you so much for your help !
You can use process pool from the multiprocessing module
def test(a):
return a
from multiprocessing.dummy import Pool
p = Pool(3)
a=p.starmap(test, zip([1,2,3]))
print(a)
p.close()
p.join()
kar's answer works, however keep in mind that he's using the .dummy module which might be limited by the GIL. Heres more info on it:
multiprocessing.dummy in Python is not utilising 100% cpu
I have a python function that requires the user enter an array of data; at which point the function works on the data and produces two arrays which are returned to the main program. In this question I am including a greatly simplified example from which I hope I can solicit some advice or help. I have created a function titled "Test_Function" that requires the programmer supply an array of data titled "Array", which in this case has a length of 5000. The function works on the data and produces two sets of arrays titled "Result1" and "Result2" which are returned to the user in the main program as the variables "Res1" and "Res2". I would like to thread the function so that the function "Test_Function" so that one thread will work on half of the input array and the other thread will work on the other half and then combine them back together in the main program for both output arrays "Result1" and "Result2"/"Res1" and "Res2". I described a scenario where I would produce two threads, but I would like to make it generic enough so that it could run a user defined number of threads. How do I do this with the thread functionality?
import numpy as np
def Test_Function(Array):
Result1 = Array*np.pi*(1-Array)
Result2 = Array+478.5 + (1/Array)
return(np.array(Result1,dtype=float), np.array(Result2,dtype=float))
#---------------------------------------------------------------------------
if __name__ == "__main__":
Dependent_Array = np.linspace(1.0,5000.0,num=5000)
Res1, Res2 = Test_Function(Dependent_Array)
# eof
You could use a ThreadPool to which assign async tasks:
import numpy as np
from multiprocessing.pool import ThreadPool
def sub1(Array):
return Array * np.pi * (1-Array)
def sub2(Array):
return Array + 478.5 + (1/Array)
def Test_Function(Array):
pool = ThreadPool(processes=2)
res1 = pool.apply_async(sub1, (Array,))
res2 = pool.apply_async(sub2, (Array,))
return (np.array(res1.get(),dtype=float), np.array(res2.get(),dtype=float))
if __name__ == "__main__":
Dependent_Array = np.linspace(1.0,5000.0,num=5000)
Res1, Res2 = Test_Function(Dependent_Array)
This module is not very well documented, but, basically, it creates a pool of workers to which you can assign subroutines to execute. The method get() of the result object created by apply_async will return the result only when the corresponding thread has finished its operations.
I have a function defined which renders a MxN array.
The array is very huge hence I want to use the function to produce small arrays (M1xN, M2xN, M3xN --- MixN. M1+M2+M3+---+Mi = M) simultaneously using multi-processing/threading and eventually join these arrays to form mxn array. As Mr. Boardrider rightfully suggested to provide a viable example, following example would broadly convey what I intend to do
import numpy as n
def mult(y,x):
r = n.empty([len(y),len(x)])
for i in range(len(r)):
r[i] = y[i]*x
return r
x = n.random.rand(10000)
y = n.arange(0,100000,1)
test = mult(y=y,x=x)
As the lengths of x and y increase the system will take more and more time. With respect to this example, I want to run this code such that if I have 4 cores, I can give quarter of the job to each, i.e give job to compute elements r[0] to r[24999] to the 1st core, r[25000] to r[49999] to the 2nd core, r[50000] to r[74999] to the 3rd core and r[75000] to r[99999] to the 4th core. Eventually club the results, append them to get one single array r[0] to r[99999].
I hope this example makes things clear. If my problem is still not clear, please tell.
The first thing to say is: if it's about multiple cores on the same processor, numpy is already capable of parallelizing the operation better than we could ever do by hand (see the discussion at multiplication of large arrays in python )
In this case the key would be simply to ensure that the multiplication is all done in a wholesale array operation rather than a Python for-loop:
test2 = x[n.newaxis, :] * y[:, n.newaxis]
n.abs( test - test2 ).max() # verify equivalence to mult(): output should be 0.0, or very small reflecting floating-point precision limitations
[If you actually wanted to spread this across multiple separate CPUs, that's a different matter, but the question seems to suggest a single (multi-core) CPU.]
OK, bearing the above in mind: let's suppose you want to parallelize an operation more complicated than just mult(). Let's assume you've tried hard to optimize your operation into wholesale array operations that numpy can parallelize itself, but your operation just isn't susceptible to this. In that case, you can use a shared-memory multiprocessing.Array created with lock=False, and multiprocessing.Pool to assign processes to address non-overlapping chunks of it, divided up over the y dimension (and also simultaneously over x if you want). An example listing is provided below. Note that this approach does not explicitly do exactly what you specify (club the results together and append them into a single array). Rather, it does something more efficient: multiple processes simultaneously assemble their portions of the answer in non-overlapping portions of shared memory. Once done, no collation/appending is necessary: we just read out the result.
import os, numpy, multiprocessing, itertools
SHARED_VARS = {} # the best way to get multiprocessing.Pool to send shared multiprocessing.Array objects between processes is to attach them to something global - see http://stackoverflow.com/questions/1675766/
def operate( slices ):
# grok the inputs
yslice, xslice = slices
y, x, r = get_shared_arrays('y', 'x', 'r')
# create views of the appropriate chunks/slices of the arrays:
y = y[yslice]
x = x[xslice]
r = r[yslice, xslice]
# do the actual business
for i in range(len(r)):
r[i] = y[i] * x # If this is truly all operate() does, it can be parallelized far more efficiently by numpy itself.
# But let's assume this is a placeholder for something more complicated.
return 'Process %d operated on y[%s] and x[%s] (%d x %d chunk)' % (os.getpid(), slicestr(yslice), slicestr(xslice), y.size, x.size)
def check(y, x, r):
r2 = x[numpy.newaxis, :] * y[:, numpy.newaxis] # obviously this check will only be valid if operate() literally does only multiplication (in which case this whole business is unncessary)
print( 'max. abs. diff. = %g' % numpy.abs(r - r2).max() )
return y, x, r
def slicestr(s):
return ':'.join( '' if x is None else str(x) for x in [s.start, s.stop, s.step] )
def m2n(buf, shape, typecode, ismatrix=False):
"""
Return a numpy.array VIEW of a multiprocessing.Array given a
handle to the array, the shape, the data typecode, and a boolean
flag indicating whether the result should be cast as a matrix.
"""
a = numpy.frombuffer(buf, dtype=typecode).reshape(shape)
if ismatrix: a = numpy.asmatrix(a)
return a
def n2m(a):
"""
Return a multiprocessing.Array COPY of a numpy.array, together
with shape, typecode and matrix flag.
"""
if not isinstance(a, numpy.ndarray): a = numpy.array(a)
return multiprocessing.Array(a.dtype.char, a.flat, lock=False), tuple(a.shape), a.dtype.char, isinstance(a, numpy.matrix)
def new_shared_array(shape, typecode='d', ismatrix=False):
"""
Allocate a new shared array and return all the details required
to reinterpret it as a numpy array or matrix (same order of
output arguments as n2m)
"""
typecode = numpy.dtype(typecode).char
return multiprocessing.Array(typecode, int(numpy.prod(shape)), lock=False), tuple(shape), typecode, ismatrix
def get_shared_arrays(*names):
return [m2n(*SHARED_VARS[name]) for name in names]
def init(*pargs, **kwargs):
SHARED_VARS.update(pargs, **kwargs)
if __name__ == '__main__':
ylen = 1000
xlen = 2000
init( y=n2m(range(ylen)) )
init( x=n2m(numpy.random.rand(xlen)) )
init( r=new_shared_array([ylen, xlen], float) )
print('Master process ID is %s' % os.getpid())
#print( operate([slice(None), slice(None)]) ); check(*get_shared_arrays('y', 'x', 'r')) # local test
pool = multiprocessing.Pool(initializer=init, initargs=SHARED_VARS.items())
yslices = [slice(0,333), slice(333,666), slice(666,None)]
xslices = [slice(0,1000), slice(1000,None)]
#xslices = [slice(None)] # uncomment this if you only want to divide things up in the y dimension
reports = pool.map(operate, itertools.product(yslices, xslices))
print('\n'.join(reports))
y, x, r = check(*get_shared_arrays('y', 'x', 'r'))