Use multi-processing/threading to break numpy array operation into chunks - python

I have a function defined which renders a MxN array.
The array is very huge hence I want to use the function to produce small arrays (M1xN, M2xN, M3xN --- MixN. M1+M2+M3+---+Mi = M) simultaneously using multi-processing/threading and eventually join these arrays to form mxn array. As Mr. Boardrider rightfully suggested to provide a viable example, following example would broadly convey what I intend to do
import numpy as n
def mult(y,x):
r = n.empty([len(y),len(x)])
for i in range(len(r)):
r[i] = y[i]*x
return r
x = n.random.rand(10000)
y = n.arange(0,100000,1)
test = mult(y=y,x=x)
As the lengths of x and y increase the system will take more and more time. With respect to this example, I want to run this code such that if I have 4 cores, I can give quarter of the job to each, i.e give job to compute elements r[0] to r[24999] to the 1st core, r[25000] to r[49999] to the 2nd core, r[50000] to r[74999] to the 3rd core and r[75000] to r[99999] to the 4th core. Eventually club the results, append them to get one single array r[0] to r[99999].
I hope this example makes things clear. If my problem is still not clear, please tell.

The first thing to say is: if it's about multiple cores on the same processor, numpy is already capable of parallelizing the operation better than we could ever do by hand (see the discussion at multiplication of large arrays in python )
In this case the key would be simply to ensure that the multiplication is all done in a wholesale array operation rather than a Python for-loop:
test2 = x[n.newaxis, :] * y[:, n.newaxis]
n.abs( test - test2 ).max() # verify equivalence to mult(): output should be 0.0, or very small reflecting floating-point precision limitations
[If you actually wanted to spread this across multiple separate CPUs, that's a different matter, but the question seems to suggest a single (multi-core) CPU.]
OK, bearing the above in mind: let's suppose you want to parallelize an operation more complicated than just mult(). Let's assume you've tried hard to optimize your operation into wholesale array operations that numpy can parallelize itself, but your operation just isn't susceptible to this. In that case, you can use a shared-memory multiprocessing.Array created with lock=False, and multiprocessing.Pool to assign processes to address non-overlapping chunks of it, divided up over the y dimension (and also simultaneously over x if you want). An example listing is provided below. Note that this approach does not explicitly do exactly what you specify (club the results together and append them into a single array). Rather, it does something more efficient: multiple processes simultaneously assemble their portions of the answer in non-overlapping portions of shared memory. Once done, no collation/appending is necessary: we just read out the result.
import os, numpy, multiprocessing, itertools
SHARED_VARS = {} # the best way to get multiprocessing.Pool to send shared multiprocessing.Array objects between processes is to attach them to something global - see http://stackoverflow.com/questions/1675766/
def operate( slices ):
# grok the inputs
yslice, xslice = slices
y, x, r = get_shared_arrays('y', 'x', 'r')
# create views of the appropriate chunks/slices of the arrays:
y = y[yslice]
x = x[xslice]
r = r[yslice, xslice]
# do the actual business
for i in range(len(r)):
r[i] = y[i] * x # If this is truly all operate() does, it can be parallelized far more efficiently by numpy itself.
# But let's assume this is a placeholder for something more complicated.
return 'Process %d operated on y[%s] and x[%s] (%d x %d chunk)' % (os.getpid(), slicestr(yslice), slicestr(xslice), y.size, x.size)
def check(y, x, r):
r2 = x[numpy.newaxis, :] * y[:, numpy.newaxis] # obviously this check will only be valid if operate() literally does only multiplication (in which case this whole business is unncessary)
print( 'max. abs. diff. = %g' % numpy.abs(r - r2).max() )
return y, x, r
def slicestr(s):
return ':'.join( '' if x is None else str(x) for x in [s.start, s.stop, s.step] )
def m2n(buf, shape, typecode, ismatrix=False):
"""
Return a numpy.array VIEW of a multiprocessing.Array given a
handle to the array, the shape, the data typecode, and a boolean
flag indicating whether the result should be cast as a matrix.
"""
a = numpy.frombuffer(buf, dtype=typecode).reshape(shape)
if ismatrix: a = numpy.asmatrix(a)
return a
def n2m(a):
"""
Return a multiprocessing.Array COPY of a numpy.array, together
with shape, typecode and matrix flag.
"""
if not isinstance(a, numpy.ndarray): a = numpy.array(a)
return multiprocessing.Array(a.dtype.char, a.flat, lock=False), tuple(a.shape), a.dtype.char, isinstance(a, numpy.matrix)
def new_shared_array(shape, typecode='d', ismatrix=False):
"""
Allocate a new shared array and return all the details required
to reinterpret it as a numpy array or matrix (same order of
output arguments as n2m)
"""
typecode = numpy.dtype(typecode).char
return multiprocessing.Array(typecode, int(numpy.prod(shape)), lock=False), tuple(shape), typecode, ismatrix
def get_shared_arrays(*names):
return [m2n(*SHARED_VARS[name]) for name in names]
def init(*pargs, **kwargs):
SHARED_VARS.update(pargs, **kwargs)
if __name__ == '__main__':
ylen = 1000
xlen = 2000
init( y=n2m(range(ylen)) )
init( x=n2m(numpy.random.rand(xlen)) )
init( r=new_shared_array([ylen, xlen], float) )
print('Master process ID is %s' % os.getpid())
#print( operate([slice(None), slice(None)]) ); check(*get_shared_arrays('y', 'x', 'r')) # local test
pool = multiprocessing.Pool(initializer=init, initargs=SHARED_VARS.items())
yslices = [slice(0,333), slice(333,666), slice(666,None)]
xslices = [slice(0,1000), slice(1000,None)]
#xslices = [slice(None)] # uncomment this if you only want to divide things up in the y dimension
reports = pool.map(operate, itertools.product(yslices, xslices))
print('\n'.join(reports))
y, x, r = check(*get_shared_arrays('y', 'x', 'r'))

Related

How to generate complex Hypothesis data frames with internal row and column dependencies?

Is there an elegant way of using hypothesis to directly generate complex pandas data frames with internal row and column dependencies? Let's say I want columns such as:
[longitude][latitude][some-text-meta][some-numeric-meta][numeric-data][some-junk][numeric-data][…
Geographic coordinates can be individually picked at random, but sets must usually come from a general area (e.g. standard reprojections don't work if you have two points on opposite sides of the globe). It's easy to handle that by choosing an area with one strategy and columns of coordinates from inside that area with another. All good so far…
#st.composite
def plaus_spamspam_arrs(
draw,
st_lonlat=plaus_lonlat_arr,
st_values=plaus_val_arr,
st_areas=plaus_area_arr,
st_meta=plaus_meta_arr,
bounds=ARR_LEN,
):
"""Returns plausible spamspamspam arrays"""
size = draw(st.integers(*bounds))
coords = draw(st_lonlat(size=size))
values = draw(st_values(size=size))
areas = draw(st_areas(size=size))
meta = draw(st_meta(size=size))
return PlausibleData(coords, values, areas, meta)
The snippet above makes clean numpy arrays of coordinated single-value data. But the numeric data in the columns example (n-columns interspersed with junk) can also have row-wise dependencies such as needing to be normalised to some factor involving a row-wise sum and/or something else chosen dynamically at runtime.
I can generate all these bits separately, but I can't see how to stitch them into a single data frame without using a clumsy concat-based technique that, I presume, would disrupt draw-based shrinking. Moreover, I need a solution that adapts beyond what's above, so a hack likely get me too far…
Maybe there's something with builds? I just can't quite see out how to do it. Thanks for sharing if you know! A short example as inspiration would likely be enough.
Update
I can generate columns roughly as follows:
#st.composite
def plaus_df_inputs(
draw, *, nrows=None, ncols=None, nrow_bounds=ARR_LEN, ncol_bounds=COL_LEN
):
"""Returns …"""
box_lon, box_lat = draw(plaus_box_geo())
ncols_jnk = draw(st.integers(*ncol_bounds)) if ncols is None else ncols
ncols_val = draw(st.integers(*ncol_bounds)) if ncols is None else ncols
keys_val = draw(plaus_smp_key_elm(size=ncols_val))
nrows = draw(st.integers(*nrow_bounds)) if nrows is None else nrows
cols = (
plaus_df_cols_lonlat(lons=plaus_lon(box_lon), lats=plaus_lat(box_lat))
+ plaus_df_cols_meta()
+ plaus_df_cols_value(keys=keys_val)
+ draw(plaus_df_cols_junk(size=ncols_jnk))
)
random.shuffle(cols)
return draw(st_pd.data_frames(cols, index=plaus_df_idx(size=nrows)))
where the sub-stats are things like
#st.composite
def plaus_df_cols_junk(
draw, *, size=1, names=plaus_meta(), dtypes=plaus_dtype(), unique=False
):
"""Returns strategy for list of columns of plausible junk data."""
result = set()
for _ in range(size):
result.add(draw(names.filter(lambda name: name not in result)))
return [
st_pd.column(name=result.pop(), dtype=draw(dtypes), unique=unique)
for _ in range(size)
]
What I need is something more elegant that incorporates the row-based dependencies.
from hypothesis import strategies as st
#st.composite
def interval_sets(draw):
# To create our interval sets, we'll draw from a strategy that shrinks well,
# and then transform it into the format we want. More specifically, we'll use
# a single lists() strategy so that the shrinker can delete chunks atomically,
# and then rearrange the floats that we draw as part of this.
base_elems = st.tuples(
# Different floats bounds to ensure we get at least one valid start and end.
st.text(),
st.floats(0, 1, exclude_max=True),
st.floats(0, 1, exclude_min=True),
)
base = draw(st.lists(base_elems, min_size=1, unique_by=lambda t: t[0]))
nums = sorted(sum((t[1:] for t in base), start=())) # arrange our endpoints
return [
{"name": name, "start": start, "end": end, "size": end - start}
for (name, _, _), start, end in zip(base, nums[::2], nums[1::2])
]

produce vector output from a dask array

I have a large dask array (labeled_arr) that is actually a labeled raster image (dtype is int64). I want to use rasterio to turn the labeled regions into polygons and combine them into a single list of polygons (or geoseries with just a geometry column). This is a straightforward task on a single array, but I'm having trouble figuring out how to tell dask that I want it to do this operation on each chunk and return something that is not an array.
function to apply to each chunk:
def get_polys(labeled_blocks):
polys = list(poly[0]['coordinates'][0] for poly in rasterio.features.shapes(
labeled_blocks.astype('int32'), transform=trans))[:-1]
# Note: rasterio.features.shapes returns an iterator, hence the conversion to a list here
return polys
line of code trying to get dask to do this:
test_polygons = da.blockwise(get_polys, '', labeled_arr, 'ij')
test_polygons.compute()
where labeled_arr is the input chunked dask array.
Running as is returns an error saying I have to specify a dtype for da.blockwise. Specifying a dtype returns an AttributeError since the output list type does not have a dtype attribute. I discovered the meta keyword, but still have been unable to get the right syntax to turn my output into a Series or list.
I'm not attached to the above approach, but my overarching goal is: take a labeled, chunked dask dataarray (which does not all fit in memory), extract a list based on computations for each chunk, and generate a concatenated list (or pandas data object) with the outputs from all the chunks in my original chunked array.
This might work:
import dask
import dask.array as da
# we expect to see 4 blocks here
test_array = da.random.random((4, 4), chunks=(2, 2))
#dask.delayed
def my_func(block):
# do something fancy
return list(block)
results = dask.compute([my_func(x) for x in test_array.to_delayed().ravel()])
As you noted, the problem is that list has no dtype. A way around this would be to convert the list into a np.array, but I'm not sure if this will work with all geometry objects (it should be OK for Points, but polygons might be problematic due to varying length). Since you are not interested in forcing these geometries into an array, it's best to treat individual blocks as delayed objects feeding them into your function one at a time (but scaled across workers/processes).
Here's the solution I ended up with initially, though it still requires a lot of RAM given the concatenate=True kwarg.
poss_list = []
def get_polys(labeled_blocks):
polys = list(poly[0]['coordinates'][0] for poly in rasterio.features.shapes(
labeled_blocks.astype('int32'), transform=trans))[:-1]
poss_list.append(polys)
da.blockwise(get_bergs, '', labeled_arr, 'ij',
meta=pd.DataFrame({'c':[]}), concatenate=True).compute()
If I'm interpreting correctly, this doesn't feed the chunks into my function across workers/processes though (which it seems I can get away with for now).
Update - improved answer using dask.delayed, building on the accepted answer by #SultanOrazbayev
import dask
# onedem = original_xarray_dataarray
poss_list = []
#dask.delayed
def get_bergs(labeled_blocks, pointer, chunk0, chunk1):
# Note: I'm using this in a CRS (polar stereo) with negative y coordinates - it hasn't been tested for other CRSs
def getpx(chunkid, chunksz):
amin = chunkid[0] * chunksz[0][0]
amax = amin + chunksz[0][0]
bmin = chunkid[1] * chunksz[1][0]
bmax = bmin + chunksz[1][0]
return (amin, amax, bmin, bmax)
# order of all inputs (and outputs) should be y, x when axis order is used
chunksz = (onedem.chunks['y'], onedem.chunks['x'])
ymini, ymaxi, xmini, xmaxi = getpx((chunk0, chunk1), chunksz)
# use rasterio Windows and rioxarray to construct transform
# https://rasterio.readthedocs.io/en/latest/topics/windowed-rw.html#window-transforms
chwindow = rasterio.windows.Window(xmini, ymini, xmaxi-xmini, ymaxi-ymini) #.from_slices[ymini, ymaxi],[xmini, xmaxi])
trans = onedem.rio.isel_window(chwindow).rio.transform(recalc=True)
return list(poly[0]['coordinates'][0] for poly in rasterio.features.shapes(labeled_blocks.astype('int32'), transform=trans))[:-1]
for __, obj in enumerate(labeled_arr.to_delayed()):
for bl in obj:
piece = dask.delayed(get_bergs)(bl, *bl.key)
poss_list.append(piece)
poss_list = dask.compute(*poss_list)
# unnest the list of polygons returned by using dask to polygonize
concat_list = [item for sublist in poss_list for item in sublist if len(item)!=0]

Sparse matrix dot product keeping only N-max values per result row

I've got a very huge csr sparse matrix M. I want to get dot product of this matrix to itself (M.dot(M.T)) and keep only N max values per each row in the result matrix R. The problem is that dot product M.dot(M.T) raises MemoryError. So I created modified implementation of dot function, that looks like:
def dot_with_top(m1, m2, top=None):
if top is not None and top > 0:
res_rows = []
for row_id in xrange(m1.shape[0]):
row = m1[row_id]
if row.nnz > 0:
res_row = m1[row_id].dot(m2)
if res_row.nnz > top:
args_ids = np.argsort(res_row.data)[-top:]
data = res_row.data[args_ids]
cols = res_row.indices[args_ids]
res_rows.append(csr_matrix((data, (np.zeros(top), cols)), shape=res_row.shape))
else:
res_rows.append(res_row)
else:
res_rows.append(csr_matrix((1, m1.shape[0])))
return sparse.vstack(res_rows, 'csr')
return m1.dot(m2)
It works fine but it's a bit slow. Is it possible to make this calculation faster or maybe you know some already existing method that do it faster?
You can implement your loop over the number of row in a function, and call this function with the multiprocessing.Pool() object.
This will parallelize the execution of your loop and should add a nice speedup.
Example :
from multiprocessing import Pool
def f(row_id):
# define here your function inside the loop
return vstack(res_rows, 'csr')
if __name__ == '__main__':
p = Pool(4) # if you have 4 cores in your processor
p.map(f, xrange(m1.shape[0]))
source : https://docs.python.org/2/library/multiprocessing.html#using-a-pool-of-workers
Note that some python-implemented function already use multiprocessing (common in numpy), so you should check your processor activity when your script is running before implementing this solution.

Size-Incremental Numpy Array in Python

I just came across the need of an incremental Numpy array in Python, and since I haven't found anything I implemented it. I'm just wondering if my way is the best way or you can come up with other ideas.
So, the problem is that I have a 2D array (the program handles nD arrays) for which the size is not known in advance and variable amount of data need to be concatenated to the array in one direction (let's say that I've to call np.vstak a lot of times). Every time I concatenate data, I need to take the array, sort it along axis 0 and do other stuff, so I cannot construct a long list of arrays and then np.vstak the list at once.
Since memory allocation is expensive, I turned to incremental arrays, where I increment the size of the array of a quantity bigger than the size I need (I use 50% increments), so that I minimize the number of allocations.
I coded this up and you can see it in the following code:
class ExpandingArray:
__DEFAULT_ALLOC_INIT_DIM = 10 # default initial dimension for all the axis is nothing is given by the user
__DEFAULT_MAX_INCREMENT = 10 # default value in order to limit the increment of memory allocation
__MAX_INCREMENT = [] # Max increment
__ALLOC_DIMS = [] # Dimensions of the allocated np.array
__DIMS = [] # Dimensions of the view with data on the allocated np.array (__DIMS <= __ALLOC_DIMS)
__ARRAY = [] # Allocated array
def __init__(self,initData,allocInitDim=None,dtype=np.float64,maxIncrement=None):
self.__DIMS = np.array(initData.shape)
self.__MAX_INCREMENT = maxIncrement
if self.__MAX_INCREMENT == None:
self.__MAX_INCREMENT = self.__DEFAULT_MAX_INCREMENT
# Compute the allocation dimensions based on user's input
if allocInitDim == None:
allocInitDim = self.__DIMS.copy()
while np.any( allocInitDim < self.__DIMS ) or np.any(allocInitDim == 0):
for i in range(len(self.__DIMS)):
if allocInitDim[i] == 0:
allocInitDim[i] = self.__DEFAULT_ALLOC_INIT_DIM
if allocInitDim[i] < self.__DIMS[i]:
allocInitDim[i] += min(allocInitDim[i]/2, self.__MAX_INCREMENT)
# Allocate memory
self.__ALLOC_DIMS = allocInitDim
self.__ARRAY = np.zeros(self.__ALLOC_DIMS,dtype=dtype)
# Set initData
sliceIdxs = [slice(self.__DIMS[i]) for i in range(len(self.__DIMS))]
self.__ARRAY[sliceIdxs] = initData
def shape(self):
return tuple(self.__DIMS)
def getAllocArray(self):
return self.__ARRAY
def getDataArray(self):
"""
Get the view of the array with data
"""
sliceIdxs = [slice(self.__DIMS[i]) for i in range(len(self.__DIMS))]
return self.__ARRAY[sliceIdxs]
def concatenate(self,X,axis=0):
if axis > len(self.__DIMS):
print "Error: axis number exceed the number of dimensions"
return
# Check dimensions for remaining axis
for i in range(len(self.__DIMS)):
if i != axis:
if X.shape[i] != self.shape()[i]:
print "Error: Dimensions of the input array are not consistent in the axis %d" % i
return
# Check whether allocated memory is enough
needAlloc = False
while self.__ALLOC_DIMS[axis] < self.__DIMS[axis] + X.shape[axis]:
needAlloc = True
# Increase the __ALLOC_DIMS
self.__ALLOC_DIMS[axis] += min(self.__ALLOC_DIMS[axis]/2,self.__MAX_INCREMENT)
# Reallocate memory and copy old data
if needAlloc:
# Allocate
newArray = np.zeros(self.__ALLOC_DIMS)
# Copy
sliceIdxs = [slice(self.__DIMS[i]) for i in range(len(self.__DIMS))]
newArray[sliceIdxs] = self.__ARRAY[sliceIdxs]
self.__ARRAY = newArray
# Concatenate new data
sliceIdxs = []
for i in range(len(self.__DIMS)):
if i != axis:
sliceIdxs.append(slice(self.__DIMS[i]))
else:
sliceIdxs.append(slice(self.__DIMS[i],self.__DIMS[i]+X.shape[i]))
self.__ARRAY[sliceIdxs] = X
self.__DIMS[axis] += X.shape[axis]
The code shows considerably better performances than vstack/hstack several random sized concatenations.
What I'm wondering about is: is it the best way? Is there anything that do this already in numpy?
Further it would be nice to be able to overload the slice assignment operator of np.array, so that as soon as the user assign anything outside the actual dimensions, an ExpandingArray.concatenate() is performed. How to do such overloading?
Testing code: I post here also some code I used to make comparison between vstack and my method. I add up random chunk of data of maximum length 100.
import time
N = 10000
def performEA(N):
EA = ExpandingArray(np.zeros((0,2)),maxIncrement=1000)
for i in range(N):
nNew = np.random.random_integers(low=1,high=100,size=1)
X = np.random.rand(nNew,2)
EA.concatenate(X,axis=0)
# Perform operations on EA.getDataArray()
return EA
def performVStack(N):
A = np.zeros((0,2))
for i in range(N):
nNew = np.random.random_integers(low=1,high=100,size=1)
X = np.random.rand(nNew,2)
A = np.vstack((A,X))
# Perform operations on A
return A
start_EA = time.clock()
EA = performEA(N)
stop_EA = time.clock()
start_VS = time.clock()
VS = performVStack(N)
stop_VS = time.clock()
print "Elapsed Time EA: %.2f" % (stop_EA-start_EA)
print "Elapsed Time VS: %.2f" % (stop_VS-start_VS)
I think the most common design pattern for these things is to just use a list for the small arrays. Sure you could do things like dynamic resizing (if you want to do crazy things, you can try to use the resize array method too). I think a typical method is to always double the size, when you really don't know how large things will be. Of course if you know how large the array will grow to, just allocating the full thing up front is simplest.
def performVStack_fromlist(N):
l = []
for i in range(N):
nNew = np.random.random_integers(low=1,high=100,size=1)
X = np.random.rand(nNew,2)
l.append(X)
return np.vstack(l)
I am sure there are some use cases where an expanding array could be useful (for example when the appending arrays are all very small), but this loop seems better handled with the above pattern. The optimization is mostly about how often you need to copy everything around, and doing a list like this (other then the list itself) this is exactly once here. So it is much faster normally.
When I faced a similar problem, I used ndarray.resize() (http://docs.scipy.org/doc/numpy/reference/generated/numpy.ndarray.resize.html#numpy.ndarray.resize). Most of the time, it will avoid reallocation+copying altogether. I can't guarantee it would prove to be faster (it probably would), but it's so much simpler.
As for your second question, I think overriding slice assignment for extending purposes is not a good idea. That operator is meant for assigning to existing items/slices. If you want to change that, it's not immediately clear how you'd want it to behave in some cases, e.g.:
a = MyExtendableArray(np.arange(100))
a[200] = 6 # resize to 200? pad [100:200] with what?
a[90:110] = 7 # assign to existing items AND automagically-allocated items?
a[::-1][200] = 6 # ...
My suggestion is that slice-assignment and data appending should remain separate.

ipython map_async input and output data

I am new to the IPython parallel package but really want to get it going. What I have is a 4D numpy array which I want to run through slices,rows,columns and process the 4th dimension (time). The processing is a minimization routine that takes a bit of time which is why I would like to parallelize it.
from IPython.parallel import Client
from numpy import *
from matplotlib.pylab import *
c = Client()
v = c.load_balanced_view()
v.block=False
def process( src, freq, d ):
# Get slice, row, col
sl,r,c = src
# Get data
mm = d[:,sl,c,r]
# Call fitting routine
<fiting routine that requires freq, mm and outputs multiple parameters>
return <output parameters??>
## Create the mask of what we are going to process
mask = zeros(d[0].shape)
mask[sl][ nonzero( d[0,sl] > 10*median(d[0]) ) ] = 1
# find all non-zero points in the mask
points = array(nonzero( mask == 1)).transpose()
# Call async
asyncresult = v.map_async( process, points, freq=freq, d=d )
My function "process" requires two parameters: 1) freq is a numpy array (100,1) and 2) d which is (100, 50, 110, 110) or so. I want to retrieve several parameters from the fitting.
All the examples I have seen that use map_async have simple lambda functions etc and the outputs seem to be trivial.
What I want is to apply "process" to every point in d where the mask is not zero and to have maps of the output parameters in the same space. [Added: I am getting "process() takes exactly 3 arguments (1 given) ].
(Step 2 of this might be required as I am passing a huge numpy array "d" to each process. But once I figure out the data passing I should hopefully be able to figure out a more efficient way of doing this.)
Thanks for any help.
I got around the data passing problem by doing
def mapper(x):
return apply(x[0], x[1:])
And calling map_async with a list of tuples where the first element is my function and the rest of the elements are the parameters to my function.
asyncResult = pool.map_async(mapper, [(func, arg1, arg2) for arg1, arg2 in myArgs])
I tried a lambda first but apparently that couldn't be pickled so that was a no go.

Categories