produce vector output from a dask array - python

I have a large dask array (labeled_arr) that is actually a labeled raster image (dtype is int64). I want to use rasterio to turn the labeled regions into polygons and combine them into a single list of polygons (or geoseries with just a geometry column). This is a straightforward task on a single array, but I'm having trouble figuring out how to tell dask that I want it to do this operation on each chunk and return something that is not an array.
function to apply to each chunk:
def get_polys(labeled_blocks):
polys = list(poly[0]['coordinates'][0] for poly in rasterio.features.shapes(
labeled_blocks.astype('int32'), transform=trans))[:-1]
# Note: rasterio.features.shapes returns an iterator, hence the conversion to a list here
return polys
line of code trying to get dask to do this:
test_polygons = da.blockwise(get_polys, '', labeled_arr, 'ij')
test_polygons.compute()
where labeled_arr is the input chunked dask array.
Running as is returns an error saying I have to specify a dtype for da.blockwise. Specifying a dtype returns an AttributeError since the output list type does not have a dtype attribute. I discovered the meta keyword, but still have been unable to get the right syntax to turn my output into a Series or list.
I'm not attached to the above approach, but my overarching goal is: take a labeled, chunked dask dataarray (which does not all fit in memory), extract a list based on computations for each chunk, and generate a concatenated list (or pandas data object) with the outputs from all the chunks in my original chunked array.

This might work:
import dask
import dask.array as da
# we expect to see 4 blocks here
test_array = da.random.random((4, 4), chunks=(2, 2))
#dask.delayed
def my_func(block):
# do something fancy
return list(block)
results = dask.compute([my_func(x) for x in test_array.to_delayed().ravel()])
As you noted, the problem is that list has no dtype. A way around this would be to convert the list into a np.array, but I'm not sure if this will work with all geometry objects (it should be OK for Points, but polygons might be problematic due to varying length). Since you are not interested in forcing these geometries into an array, it's best to treat individual blocks as delayed objects feeding them into your function one at a time (but scaled across workers/processes).

Here's the solution I ended up with initially, though it still requires a lot of RAM given the concatenate=True kwarg.
poss_list = []
def get_polys(labeled_blocks):
polys = list(poly[0]['coordinates'][0] for poly in rasterio.features.shapes(
labeled_blocks.astype('int32'), transform=trans))[:-1]
poss_list.append(polys)
da.blockwise(get_bergs, '', labeled_arr, 'ij',
meta=pd.DataFrame({'c':[]}), concatenate=True).compute()
If I'm interpreting correctly, this doesn't feed the chunks into my function across workers/processes though (which it seems I can get away with for now).
Update - improved answer using dask.delayed, building on the accepted answer by #SultanOrazbayev
import dask
# onedem = original_xarray_dataarray
poss_list = []
#dask.delayed
def get_bergs(labeled_blocks, pointer, chunk0, chunk1):
# Note: I'm using this in a CRS (polar stereo) with negative y coordinates - it hasn't been tested for other CRSs
def getpx(chunkid, chunksz):
amin = chunkid[0] * chunksz[0][0]
amax = amin + chunksz[0][0]
bmin = chunkid[1] * chunksz[1][0]
bmax = bmin + chunksz[1][0]
return (amin, amax, bmin, bmax)
# order of all inputs (and outputs) should be y, x when axis order is used
chunksz = (onedem.chunks['y'], onedem.chunks['x'])
ymini, ymaxi, xmini, xmaxi = getpx((chunk0, chunk1), chunksz)
# use rasterio Windows and rioxarray to construct transform
# https://rasterio.readthedocs.io/en/latest/topics/windowed-rw.html#window-transforms
chwindow = rasterio.windows.Window(xmini, ymini, xmaxi-xmini, ymaxi-ymini) #.from_slices[ymini, ymaxi],[xmini, xmaxi])
trans = onedem.rio.isel_window(chwindow).rio.transform(recalc=True)
return list(poly[0]['coordinates'][0] for poly in rasterio.features.shapes(labeled_blocks.astype('int32'), transform=trans))[:-1]
for __, obj in enumerate(labeled_arr.to_delayed()):
for bl in obj:
piece = dask.delayed(get_bergs)(bl, *bl.key)
poss_list.append(piece)
poss_list = dask.compute(*poss_list)
# unnest the list of polygons returned by using dask to polygonize
concat_list = [item for sublist in poss_list for item in sublist if len(item)!=0]

Related

How to generate complex Hypothesis data frames with internal row and column dependencies?

Is there an elegant way of using hypothesis to directly generate complex pandas data frames with internal row and column dependencies? Let's say I want columns such as:
[longitude][latitude][some-text-meta][some-numeric-meta][numeric-data][some-junk][numeric-data][…
Geographic coordinates can be individually picked at random, but sets must usually come from a general area (e.g. standard reprojections don't work if you have two points on opposite sides of the globe). It's easy to handle that by choosing an area with one strategy and columns of coordinates from inside that area with another. All good so far…
#st.composite
def plaus_spamspam_arrs(
draw,
st_lonlat=plaus_lonlat_arr,
st_values=plaus_val_arr,
st_areas=plaus_area_arr,
st_meta=plaus_meta_arr,
bounds=ARR_LEN,
):
"""Returns plausible spamspamspam arrays"""
size = draw(st.integers(*bounds))
coords = draw(st_lonlat(size=size))
values = draw(st_values(size=size))
areas = draw(st_areas(size=size))
meta = draw(st_meta(size=size))
return PlausibleData(coords, values, areas, meta)
The snippet above makes clean numpy arrays of coordinated single-value data. But the numeric data in the columns example (n-columns interspersed with junk) can also have row-wise dependencies such as needing to be normalised to some factor involving a row-wise sum and/or something else chosen dynamically at runtime.
I can generate all these bits separately, but I can't see how to stitch them into a single data frame without using a clumsy concat-based technique that, I presume, would disrupt draw-based shrinking. Moreover, I need a solution that adapts beyond what's above, so a hack likely get me too far…
Maybe there's something with builds? I just can't quite see out how to do it. Thanks for sharing if you know! A short example as inspiration would likely be enough.
Update
I can generate columns roughly as follows:
#st.composite
def plaus_df_inputs(
draw, *, nrows=None, ncols=None, nrow_bounds=ARR_LEN, ncol_bounds=COL_LEN
):
"""Returns …"""
box_lon, box_lat = draw(plaus_box_geo())
ncols_jnk = draw(st.integers(*ncol_bounds)) if ncols is None else ncols
ncols_val = draw(st.integers(*ncol_bounds)) if ncols is None else ncols
keys_val = draw(plaus_smp_key_elm(size=ncols_val))
nrows = draw(st.integers(*nrow_bounds)) if nrows is None else nrows
cols = (
plaus_df_cols_lonlat(lons=plaus_lon(box_lon), lats=plaus_lat(box_lat))
+ plaus_df_cols_meta()
+ plaus_df_cols_value(keys=keys_val)
+ draw(plaus_df_cols_junk(size=ncols_jnk))
)
random.shuffle(cols)
return draw(st_pd.data_frames(cols, index=plaus_df_idx(size=nrows)))
where the sub-stats are things like
#st.composite
def plaus_df_cols_junk(
draw, *, size=1, names=plaus_meta(), dtypes=plaus_dtype(), unique=False
):
"""Returns strategy for list of columns of plausible junk data."""
result = set()
for _ in range(size):
result.add(draw(names.filter(lambda name: name not in result)))
return [
st_pd.column(name=result.pop(), dtype=draw(dtypes), unique=unique)
for _ in range(size)
]
What I need is something more elegant that incorporates the row-based dependencies.
from hypothesis import strategies as st
#st.composite
def interval_sets(draw):
# To create our interval sets, we'll draw from a strategy that shrinks well,
# and then transform it into the format we want. More specifically, we'll use
# a single lists() strategy so that the shrinker can delete chunks atomically,
# and then rearrange the floats that we draw as part of this.
base_elems = st.tuples(
# Different floats bounds to ensure we get at least one valid start and end.
st.text(),
st.floats(0, 1, exclude_max=True),
st.floats(0, 1, exclude_min=True),
)
base = draw(st.lists(base_elems, min_size=1, unique_by=lambda t: t[0]))
nums = sorted(sum((t[1:] for t in base), start=())) # arrange our endpoints
return [
{"name": name, "start": start, "end": end, "size": end - start}
for (name, _, _), start, end in zip(base, nums[::2], nums[1::2])
]

Apply ufunc to xarray single Dataset variable as delayed operation using dask

I would like to apply a custom function to a variable within an xarray.Dataset modifying only the specified variable. At the same time I am trying to make this part of a dask computation graph so it can be delayed prior to reading out to disk with to_netcdf.
At the moment I can apply the ufunc using xr.apply_ufunc() but only to all variables within the Dataset.
I understand I could probably access the variable directly using it's name like Dataset.var and pass this to apply_ufunc() but I don't quite understand how the output of this function (a delayed future) would be recombined with the original dataset prior to output.
Ideally, I want to do something like this (where 'data.nc' has multiple variables and only var1 is squared).
import xarray as xr
from distributed import Client
dask_client = Client()
def square(x):
return x*x
data = xr.open_dataset('data.nc', chunks={'d1':10})
fut_sq = xr.apply_ufunc(square, data.var1, dask='parallelized', output_dtypes=['float'])
data.var1 = fut_sq.var1
fut_save = data.to_netcft('new.nc', compute=False)
dask_client.compute(fut_save)
So I played around with this a bit more and decided that the best way to do this was to extract the data from the netCDF4 file, convert it to a dask.array and then rewrite a new file to disk. This involves writing custom functions using the dask.delayed functionality. Using the ufunc approach was probably inappropriate for my problem.
A few drawbacks of this:
You don't seem to be able to modify the file in place. To save the modified variables from the original NetCDF4 file you have to rewrite the whole file to disk.
For me at least, the best way to parallelise the custom square function was to create my own data chunks and pass these to chunks individually to square. Then reconstitute them using dask.array.concatenate. I know dask has some bagging functionality but I struggled to get it to work the way I wanted.
The reading of the file happens in parallel but it does not appear that dask writes to NetCDF4 in parallel.
It would be great if I can be corrected on these points.
Here is my amended example
import xarray as xr
from distributed import Client
import dask
import dask.array as da
dask_client = Client()
def bag_slices(ind, n=10):
bag = list()
prev = 0
for i in range(len(ind)):
if (i+1)%n == 0:
bag.append(slice(prev, i+1, 1))
prev = i+1
if prev != i+1:
bag.append(slice(prev, i+1, 1))
return bag
#dask.delayed
def square(x):
return x*x
#dask.delayed
def assign(old_xr_dataset, new_data):
old_xr_dataset['var1'].values = new_data
return old_xr_dataset
# for me data.data.var1 is 3D and I process by splitting the data along the second dimension.
with xr.open_dataset('data.nc', chunks={'d1':10}) as data:
# create slice bags for distributed processing along preferred axis
bags = bag_slices(data.coords['dim2'].values, n=10)
# convert to dask array
data_da = da.from_array(data.var1.values)
# create data bags
bags = [data_da[:, slc, :] for slc in bags]
future_squared = []
for data_bag in bags:
# concatenate doesn't understand delayed objects
# so must convert them back to delayed arrays
future_squared.append(da.from_delayed(square(data_bag), data_bag.shape, dtype=float))
data_new = dask.array.concatenate(future_squared, axis=1)
fut_dataset = assign(data, data_new)
fut_nc_save = fut_dataset.to_netcdf('data_squared.nc', compute=False)
fut_nc_save.compute()

Using shared array in multiprocessing

I am trying to run a parallel process in python, wherein I have to extract certain polygons from a large array based on some conditions. The large array has 10k+ polygons that are indexed.
In a extract_polygon function I pass (array, index). Based on index the function has to either return the polygon corresponding to that index or not based on the conditions defined. The array is never changed and is only used for reading the polygon based on the index provided.
Since the array is very large, I am running into out of memory error during parallel processing. how can I avoid that? (In a way, how to effectively use shared array in multiprocessing?)
Below is my sample code:
def extract_polygon(array, index):
try:
islays = ndimage.find_objects(clone==index)
poly = clone[islays[0][0],islays[0][1]]
area = np.count_nonzero(ploy)
minArea = 100
maxArea = 10000
if (area > minArea) and (area < maxArea):
return poly
else:
return None
except:
return None
start = time.time()
pool = mp.Pool(10)
results = pool.starmap(get_objects,[(array, index) for index in indices])
pool.close()
pool.join()
#indices here is a list of all the indexes we have.
Can I use any other library like ray in this case?
You can absolutely use a library like Ray.
The structure would look something like this (simplified to remove your application logic).
import numpy as np
import ray
ray.init()
# Create the array and store it in shared memory once.
array = np.ones(10**6)
array_id = ray.put(array)
#ray.remote
def extract_polygon(array, index):
# Change this to actual extract the polygon.
return index
# Start 10 tasks that each take in the ID of the array in shared memory.
# These tasks execute in parallel (assuming there are enough CPU resources).
result_ids = [extract_polygon.remote(array_id, i) for i in range(10)]
# Fetch the results.
results = ray.get(result_ids)
You can read more about Ray in the documentation.
See some related answers below:
Shared-memory objects in multiprocessing
python3 multiprocess shared numpy array(read-only)

Use multi-processing/threading to break numpy array operation into chunks

I have a function defined which renders a MxN array.
The array is very huge hence I want to use the function to produce small arrays (M1xN, M2xN, M3xN --- MixN. M1+M2+M3+---+Mi = M) simultaneously using multi-processing/threading and eventually join these arrays to form mxn array. As Mr. Boardrider rightfully suggested to provide a viable example, following example would broadly convey what I intend to do
import numpy as n
def mult(y,x):
r = n.empty([len(y),len(x)])
for i in range(len(r)):
r[i] = y[i]*x
return r
x = n.random.rand(10000)
y = n.arange(0,100000,1)
test = mult(y=y,x=x)
As the lengths of x and y increase the system will take more and more time. With respect to this example, I want to run this code such that if I have 4 cores, I can give quarter of the job to each, i.e give job to compute elements r[0] to r[24999] to the 1st core, r[25000] to r[49999] to the 2nd core, r[50000] to r[74999] to the 3rd core and r[75000] to r[99999] to the 4th core. Eventually club the results, append them to get one single array r[0] to r[99999].
I hope this example makes things clear. If my problem is still not clear, please tell.
The first thing to say is: if it's about multiple cores on the same processor, numpy is already capable of parallelizing the operation better than we could ever do by hand (see the discussion at multiplication of large arrays in python )
In this case the key would be simply to ensure that the multiplication is all done in a wholesale array operation rather than a Python for-loop:
test2 = x[n.newaxis, :] * y[:, n.newaxis]
n.abs( test - test2 ).max() # verify equivalence to mult(): output should be 0.0, or very small reflecting floating-point precision limitations
[If you actually wanted to spread this across multiple separate CPUs, that's a different matter, but the question seems to suggest a single (multi-core) CPU.]
OK, bearing the above in mind: let's suppose you want to parallelize an operation more complicated than just mult(). Let's assume you've tried hard to optimize your operation into wholesale array operations that numpy can parallelize itself, but your operation just isn't susceptible to this. In that case, you can use a shared-memory multiprocessing.Array created with lock=False, and multiprocessing.Pool to assign processes to address non-overlapping chunks of it, divided up over the y dimension (and also simultaneously over x if you want). An example listing is provided below. Note that this approach does not explicitly do exactly what you specify (club the results together and append them into a single array). Rather, it does something more efficient: multiple processes simultaneously assemble their portions of the answer in non-overlapping portions of shared memory. Once done, no collation/appending is necessary: we just read out the result.
import os, numpy, multiprocessing, itertools
SHARED_VARS = {} # the best way to get multiprocessing.Pool to send shared multiprocessing.Array objects between processes is to attach them to something global - see http://stackoverflow.com/questions/1675766/
def operate( slices ):
# grok the inputs
yslice, xslice = slices
y, x, r = get_shared_arrays('y', 'x', 'r')
# create views of the appropriate chunks/slices of the arrays:
y = y[yslice]
x = x[xslice]
r = r[yslice, xslice]
# do the actual business
for i in range(len(r)):
r[i] = y[i] * x # If this is truly all operate() does, it can be parallelized far more efficiently by numpy itself.
# But let's assume this is a placeholder for something more complicated.
return 'Process %d operated on y[%s] and x[%s] (%d x %d chunk)' % (os.getpid(), slicestr(yslice), slicestr(xslice), y.size, x.size)
def check(y, x, r):
r2 = x[numpy.newaxis, :] * y[:, numpy.newaxis] # obviously this check will only be valid if operate() literally does only multiplication (in which case this whole business is unncessary)
print( 'max. abs. diff. = %g' % numpy.abs(r - r2).max() )
return y, x, r
def slicestr(s):
return ':'.join( '' if x is None else str(x) for x in [s.start, s.stop, s.step] )
def m2n(buf, shape, typecode, ismatrix=False):
"""
Return a numpy.array VIEW of a multiprocessing.Array given a
handle to the array, the shape, the data typecode, and a boolean
flag indicating whether the result should be cast as a matrix.
"""
a = numpy.frombuffer(buf, dtype=typecode).reshape(shape)
if ismatrix: a = numpy.asmatrix(a)
return a
def n2m(a):
"""
Return a multiprocessing.Array COPY of a numpy.array, together
with shape, typecode and matrix flag.
"""
if not isinstance(a, numpy.ndarray): a = numpy.array(a)
return multiprocessing.Array(a.dtype.char, a.flat, lock=False), tuple(a.shape), a.dtype.char, isinstance(a, numpy.matrix)
def new_shared_array(shape, typecode='d', ismatrix=False):
"""
Allocate a new shared array and return all the details required
to reinterpret it as a numpy array or matrix (same order of
output arguments as n2m)
"""
typecode = numpy.dtype(typecode).char
return multiprocessing.Array(typecode, int(numpy.prod(shape)), lock=False), tuple(shape), typecode, ismatrix
def get_shared_arrays(*names):
return [m2n(*SHARED_VARS[name]) for name in names]
def init(*pargs, **kwargs):
SHARED_VARS.update(pargs, **kwargs)
if __name__ == '__main__':
ylen = 1000
xlen = 2000
init( y=n2m(range(ylen)) )
init( x=n2m(numpy.random.rand(xlen)) )
init( r=new_shared_array([ylen, xlen], float) )
print('Master process ID is %s' % os.getpid())
#print( operate([slice(None), slice(None)]) ); check(*get_shared_arrays('y', 'x', 'r')) # local test
pool = multiprocessing.Pool(initializer=init, initargs=SHARED_VARS.items())
yslices = [slice(0,333), slice(333,666), slice(666,None)]
xslices = [slice(0,1000), slice(1000,None)]
#xslices = [slice(None)] # uncomment this if you only want to divide things up in the y dimension
reports = pool.map(operate, itertools.product(yslices, xslices))
print('\n'.join(reports))
y, x, r = check(*get_shared_arrays('y', 'x', 'r'))

constructing a wav file and writing it to disk using scipy

I wish to deconstruct a wave file into small chunks, reassemble it in a different order and then write it to disk.
I seem to have problems with writing it after reassembling the pieces so for now I just try to debug this section and worry about the rest later.
Basically I read the original wav into a 2D numpy array, break it into 100 piece stored within a list of smaller 2D numpy arrays, and then stack these arrays vertically using vstack:
import scipy.io.wavfile as sciwav
import numpy
[sr,stereo_data] = sciwav.read('filename')
nparts = 100
stereo_parts = list()
part_length = len(stereo_data) / nparts
for i in range(nparts):
start = i*part_length
end = (i+1)*part_length
stereo_parts.append(stereo_data[start:end])
new_data = numpy.array([0,0])
for i in range(nparts):
new_data = numpy.vstack([new_data, stereo_parts[i]])
sciwav.write('new_filename', sr, new_data)
So far I verified that new_data looks similar to stereo_data with two exceptions:
1. it has [0,0] padded at the beginning.
2. It is 88 samples shorter because len(stereo_data)/nparts does not divide without remainder.
When I try to listen to the resulting new_data eave file all I hear is silence, which I think does not make much sense.
Thanks for the help!
omer
It is very likely the dtype that is different. When you generate the zeros to pad at the beggining, you are not specifying a dtype, so they are probably np.int32. Your original data is probably np.uint8or np.uint16, so the whole array gets promoted to np.int32, which is not the right bit depth for your data. Simply do:
new_data = numpy.array([0,0], dtype=stereo_data)
I would actually rather do:
new_data = numpy.zeros((1, 2), dtype=stereo_data.dtype)
You could, by the way, streamline your code quite a bit, and get rid of a lot of for loops:
sr, stereo_data = sciwav.read('filename')
nparts = 100
part_length = len(stereo_data) // nparts
stereo_parts = numpy.split(stereo_data[:part_length*nparts], nparts)
new_data = numpy.vstack([numpy.zeros((1, 2), dtype=stereo_data.dtype)] +
stereo_parts)
sciwav.write('new_filename', sr, new_data)

Categories