Apply ufunc to xarray single Dataset variable as delayed operation using dask

Apply ufunc to xarray single Dataset variable as delayed operation using dask - python

I would like to apply a custom function to a variable within an xarray.Dataset modifying only the specified variable. At the same time I am trying to make this part of a dask computation graph so it can be delayed prior to reading out to disk with to_netcdf.
At the moment I can apply the ufunc using xr.apply_ufunc() but only to all variables within the Dataset.
I understand I could probably access the variable directly using it's name like Dataset.var and pass this to apply_ufunc() but I don't quite understand how the output of this function (a delayed future) would be recombined with the original dataset prior to output.
Ideally, I want to do something like this (where 'data.nc' has multiple variables and only var1 is squared).
import xarray as xr
from distributed import Client
dask_client = Client()
def square(x):
return x*x
data = xr.open_dataset('data.nc', chunks={'d1':10})
fut_sq = xr.apply_ufunc(square, data.var1, dask='parallelized', output_dtypes=['float'])
data.var1 = fut_sq.var1
fut_save = data.to_netcft('new.nc', compute=False)
dask_client.compute(fut_save)

So I played around with this a bit more and decided that the best way to do this was to extract the data from the netCDF4 file, convert it to a dask.array and then rewrite a new file to disk. This involves writing custom functions using the dask.delayed functionality. Using the ufunc approach was probably inappropriate for my problem.
A few drawbacks of this:
You don't seem to be able to modify the file in place. To save the modified variables from the original NetCDF4 file you have to rewrite the whole file to disk.
For me at least, the best way to parallelise the custom square function was to create my own data chunks and pass these to chunks individually to square. Then reconstitute them using dask.array.concatenate. I know dask has some bagging functionality but I struggled to get it to work the way I wanted.
The reading of the file happens in parallel but it does not appear that dask writes to NetCDF4 in parallel.
It would be great if I can be corrected on these points.
Here is my amended example
import xarray as xr
from distributed import Client
import dask
import dask.array as da
dask_client = Client()
def bag_slices(ind, n=10):
bag = list()
prev = 0
for i in range(len(ind)):
if (i+1)%n == 0:
bag.append(slice(prev, i+1, 1))
prev = i+1
if prev != i+1:
bag.append(slice(prev, i+1, 1))
return bag
#dask.delayed
def square(x):
return x*x
#dask.delayed
def assign(old_xr_dataset, new_data):
old_xr_dataset['var1'].values = new_data
return old_xr_dataset
# for me data.data.var1 is 3D and I process by splitting the data along the second dimension.
with xr.open_dataset('data.nc', chunks={'d1':10}) as data:
# create slice bags for distributed processing along preferred axis
bags = bag_slices(data.coords['dim2'].values, n=10)
# convert to dask array
data_da = da.from_array(data.var1.values)
# create data bags
bags = [data_da[:, slc, :] for slc in bags]
future_squared = []
for data_bag in bags:
# concatenate doesn't understand delayed objects
# so must convert them back to delayed arrays
future_squared.append(da.from_delayed(square(data_bag), data_bag.shape, dtype=float))
data_new = dask.array.concatenate(future_squared, axis=1)
fut_dataset = assign(data, data_new)
fut_nc_save = fut_dataset.to_netcdf('data_squared.nc', compute=False)
fut_nc_save.compute()

Related

produce vector output from a dask array

I have a large dask array (labeled_arr) that is actually a labeled raster image (dtype is int64). I want to use rasterio to turn the labeled regions into polygons and combine them into a single list of polygons (or geoseries with just a geometry column). This is a straightforward task on a single array, but I'm having trouble figuring out how to tell dask that I want it to do this operation on each chunk and return something that is not an array.
function to apply to each chunk:
def get_polys(labeled_blocks):
polys = list(poly[0]['coordinates'][0] for poly in rasterio.features.shapes(
labeled_blocks.astype('int32'), transform=trans))[:-1]
# Note: rasterio.features.shapes returns an iterator, hence the conversion to a list here
return polys
line of code trying to get dask to do this:
test_polygons = da.blockwise(get_polys, '', labeled_arr, 'ij')
test_polygons.compute()
where labeled_arr is the input chunked dask array.
Running as is returns an error saying I have to specify a dtype for da.blockwise. Specifying a dtype returns an AttributeError since the output list type does not have a dtype attribute. I discovered the meta keyword, but still have been unable to get the right syntax to turn my output into a Series or list.
I'm not attached to the above approach, but my overarching goal is: take a labeled, chunked dask dataarray (which does not all fit in memory), extract a list based on computations for each chunk, and generate a concatenated list (or pandas data object) with the outputs from all the chunks in my original chunked array.

This might work:
import dask
import dask.array as da
# we expect to see 4 blocks here
test_array = da.random.random((4, 4), chunks=(2, 2))
#dask.delayed
def my_func(block):
# do something fancy
return list(block)
results = dask.compute([my_func(x) for x in test_array.to_delayed().ravel()])
As you noted, the problem is that list has no dtype. A way around this would be to convert the list into a np.array, but I'm not sure if this will work with all geometry objects (it should be OK for Points, but polygons might be problematic due to varying length). Since you are not interested in forcing these geometries into an array, it's best to treat individual blocks as delayed objects feeding them into your function one at a time (but scaled across workers/processes).

Here's the solution I ended up with initially, though it still requires a lot of RAM given the concatenate=True kwarg.
poss_list = []
def get_polys(labeled_blocks):
polys = list(poly[0]['coordinates'][0] for poly in rasterio.features.shapes(
labeled_blocks.astype('int32'), transform=trans))[:-1]
poss_list.append(polys)
da.blockwise(get_bergs, '', labeled_arr, 'ij',
meta=pd.DataFrame({'c':[]}), concatenate=True).compute()
If I'm interpreting correctly, this doesn't feed the chunks into my function across workers/processes though (which it seems I can get away with for now).
Update - improved answer using dask.delayed, building on the accepted answer by #SultanOrazbayev
import dask
# onedem = original_xarray_dataarray
poss_list = []
#dask.delayed
def get_bergs(labeled_blocks, pointer, chunk0, chunk1):
# Note: I'm using this in a CRS (polar stereo) with negative y coordinates - it hasn't been tested for other CRSs
def getpx(chunkid, chunksz):
amin = chunkid[0] * chunksz[0][0]
amax = amin + chunksz[0][0]
bmin = chunkid[1] * chunksz[1][0]
bmax = bmin + chunksz[1][0]
return (amin, amax, bmin, bmax)
# order of all inputs (and outputs) should be y, x when axis order is used
chunksz = (onedem.chunks['y'], onedem.chunks['x'])
ymini, ymaxi, xmini, xmaxi = getpx((chunk0, chunk1), chunksz)
# use rasterio Windows and rioxarray to construct transform
# https://rasterio.readthedocs.io/en/latest/topics/windowed-rw.html#window-transforms
chwindow = rasterio.windows.Window(xmini, ymini, xmaxi-xmini, ymaxi-ymini) #.from_slices[ymini, ymaxi],[xmini, xmaxi])
trans = onedem.rio.isel_window(chwindow).rio.transform(recalc=True)
return list(poly[0]['coordinates'][0] for poly in rasterio.features.shapes(labeled_blocks.astype('int32'), transform=trans))[:-1]
for __, obj in enumerate(labeled_arr.to_delayed()):
for bl in obj:
piece = dask.delayed(get_bergs)(bl, *bl.key)
poss_list.append(piece)
poss_list = dask.compute(*poss_list)
# unnest the list of polygons returned by using dask to polygonize
concat_list = [item for sublist in poss_list for item in sublist if len(item)!=0]

How should I load a memory-intensive helper object per-worker in dask distributed?

I am currently trying to parse a very large number of text documents using dask + spaCy. SpaCy requires that I load a relatively large Language object, and I would like to load this once per worker. I have a couple of mapping functions that I would like to apply to each document, and I would hopefully not have to reinitialize this object for each future / function call. What is the best way to handle this?
Example of what I'm talking about:
def text_fields_to_sentences(
dataframe:pd.DataFrame,
...
)->pd.DataFrame:
# THIS IS WHAT I WOULD LIKE TO CHANGE
nlp, = setup_spacy(scispacy_version)
def field_to_sentences(row):
result = []
doc = nlp(row[text_field])
for sentence_tokens in doc.sents:
sentence_text = "".join([t.string for t in sentence_tokens])
r = text_data.copy()
r[sentence_text_field] = sentence_text
result.append(r)
return result
series = dataframe.apply(
field_to_sentences,
axis=1
).explode()
return pd.DataFrame(
[s[new_col_order].values for s in series],
columns=new_col_order
)
input_data.map_partitions(text_fields_to_sentences)

You could create the object as a delayed object
corpus = dask.delayed(make_corpus)("english")
And then use this lazy value in place of the full value:
df = df.text.apply(parse, corpus=corpus)
Dask will call make_corpus once on one machine and then pass it around to the workers as it is needed. It will not recompute any task.

Using shared array in multiprocessing

I am trying to run a parallel process in python, wherein I have to extract certain polygons from a large array based on some conditions. The large array has 10k+ polygons that are indexed.
In a extract_polygon function I pass (array, index). Based on index the function has to either return the polygon corresponding to that index or not based on the conditions defined. The array is never changed and is only used for reading the polygon based on the index provided.
Since the array is very large, I am running into out of memory error during parallel processing. how can I avoid that? (In a way, how to effectively use shared array in multiprocessing?)
Below is my sample code:
def extract_polygon(array, index):
try:
islays = ndimage.find_objects(clone==index)
poly = clone[islays[0][0],islays[0][1]]
area = np.count_nonzero(ploy)
minArea = 100
maxArea = 10000
if (area > minArea) and (area < maxArea):
return poly
else:
return None
except:
return None
start = time.time()
pool = mp.Pool(10)
results = pool.starmap(get_objects,[(array, index) for index in indices])
pool.close()
pool.join()
#indices here is a list of all the indexes we have.
Can I use any other library like ray in this case?

You can absolutely use a library like Ray.
The structure would look something like this (simplified to remove your application logic).
import numpy as np
import ray
ray.init()
# Create the array and store it in shared memory once.
array = np.ones(10**6)
array_id = ray.put(array)
#ray.remote
def extract_polygon(array, index):
# Change this to actual extract the polygon.
return index
# Start 10 tasks that each take in the ID of the array in shared memory.
# These tasks execute in parallel (assuming there are enough CPU resources).
result_ids = [extract_polygon.remote(array_id, i) for i in range(10)]
# Fetch the results.
results = ray.get(result_ids)
You can read more about Ray in the documentation.
See some related answers below:
Shared-memory objects in multiprocessing
python3 multiprocess shared numpy array(read-only)

How to correctly implement apply_async for data processing?

I am new to using parallel processing for data analysis. I have a fairly large array and I want to apply a function to each index of said array.
Here is the code I have so far:
import numpy as np
import statsmodels.api as sm
from statsmodels.regression.quantile_regression import QuantReg
import multiprocessing
from functools import partial
def fit_model(data,q):
#data is a 1-D array holding precipitation values
years = np.arange(1895,2018,1)
res = QuantReg(exog=sm.add_constant(years),endog=data).fit(q=q)
pointEstimate = res.params[1] #output slope of quantile q
return pointEstimate
#precipAll is an array of shape (1405*621,123,12) (longitudes*latitudes,years,months)
#find all indices where there is data
nonNaN = np.where(~np.isnan(precipAll[:,0,0]))[0] #481631 indices
month = 4
#holder array for results
asyncResults = np.zeros((precipAll.shape[0])) * np.nan
def saveResult(result,pos):
asyncResults[pos] = result
if __name__ == '__main__':
pool = multiprocessing.Pool(processes=20) #my server has 24 CPUs
for i in nonNaN:
#use partial so I can also pass the index i so the result is
#stored in the expected position
new_callback_function = partial(saveResult, pos=i)
pool.apply_async(fit_model, args=(precipAll[i,:,month],0.9),callback=new_callback_function)
pool.close()
pool.join()
When I ran this, I stopped it after it took longer than had I not used multiprocessing at all. The function, fit_model, is on the order of 0.02 seconds, so could the overhang associated with apply_async be causing the slowdown? I need to maintain order of the results as I am plotting this data onto a map after this processing is done. Any thoughts on where I need improvement is greatly appreciated!

If you need to use the multiprocessing module, you'll probably want to batch more rows together into each task that you give to the worker pool. However, for what you're doing, I'd suggest trying out Ray due to its efficient handling of large numerical data.
import numpy as np
import statsmodels.api as sm
from statsmodels.regression.quantile_regression import QuantReg
import ray
#ray.remote
def fit_model(precip_all, i, month, q):
data = precip_all[i,:,month]
years = np.arange(1895, 2018, 1)
res = QuantReg(exog=sm.add_constant(years), endog=data).fit(q=q)
pointEstimate = res.params[1]
return pointEstimate
if __name__ == '__main__':
ray.init()
# Create an array and place it in shared memory so that the workers can
# access it (in a read-only fashion) without creating copies.
precip_all = np.zeros((100, 123, 12))
precip_all_id = ray.put(precip_all)
result_ids = []
for i in range(precip_all.shape[0]):
result_ids.append(fit_model.remote(precip_all_id, i, 4, 0.9))
results = np.array(ray.get(result_ids))
Some Notes
The example above runs out of the box, but note that I simplified the logic a bit. In particular, I removed the handling of NaNs.
On my laptop with 4 physical cores, this takes about 4 seconds. If you use 20 cores instead and make the data 9000 times bigger, I'd expect it to take about 7200 seconds, which is quite a long time. One possible approach to speeding this up is to use more machines or to process multiple rows in each call to fit_model in order to amortize some of the overhead.
The above example actually passes the entire precip_all matrix into each task. This is fine because each fit_model task only has read access to a copy of the matrix stored in shared memory and so doesn't need to create its own local copy. The call to ray.put(precip_all) places the array in shared memory once up front.
For about the differences between Ray and Python multiprocessing. Note I'm helping develop Ray.

Preprocessing CSV data using tensorflow DataSet API

I'm playing around a bit with tensorflow, but am a bit confused about the input pipeline. The data I'm working on is in a large csv file, with 307 columns, of which the first is a string representing a date, and the rest are floats.
I'm running into some problems with preprocessing my data. I want to add a couple of features instead of, but based on, the date string. (specifically, a sine and a cosine representing the time). I also want to group the next 120 values in the CSV row together as one feature, the 96 ones after that as one feature, and base my label off of the remaining values in the CSV.
This is my code for generating the datasets for now:
import tensorflow as tf
defaults = []
defaults.append([""])
for i in range(0,306):
defaults.append([1.0])
def dataset(train_fraction=0.8):
path = "training_examples_shuffled.csv"
# Define how the lines of the file should be parsed
def decode_line(line):
items = tf.decode_csv(line, record_defaults=defaults)
datetimeString = items[0]
minuteFeatures = items[1:121]
halfHourFeatures = items[121:217]
labelFeatures = items[217:]
## Do something to convert datetimeString to timeSine and timeCosine
features_dict = {
'timeSine': timeSine,
'timeCosine': timeCosine,
'minuteFeatures': minuteFeatures,
'halfHourFeatures': halfHourFeatures
}
label = [1] # placeholder. I seem to need some python logic here, but I'm
not sure how to apply that to data in tensor format.
return features_dict, label
def in_training_set(line):
"""Returns a boolean tensor, true if the line is in the training set."""
num_buckets = 1000000
bucket_id = tf.string_to_hash_bucket_fast(line, num_buckets)
# Use the hash bucket id as a random number that's deterministic per example
return bucket_id < int(train_fraction * num_buckets)
def in_test_set(line):
"""Returns a boolean tensor, true if the line is in the training set."""
return ~in_training_set(line)
base_dataset = (tf.data
# Get the lines from the file.
.TextLineDataset(path))
train = (base_dataset
# Take only the training-set lines.
.filter(in_training_set)
# Decode each line into a (features_dict, label) pair.
.map(decode_line))
# Do the same for the test-set.
test = (base_dataset.filter(in_test_set).map(decode_line))
return train, test
My question now is: how can I access the string in the datetimeString Tensor to convert it to a datetime object? Or is this the wrong place to be doing this? I'd like to use the time and the day of the week as input features.
And secondly: Pretty much the same for the label based on the remaining values of the CSV. Can I just use standard python code for this in some way, or should I be using basic tensorflow ops to achieve what I want, if possible?
Finally, any comments on whether this is a decent way of handling my inputs? Tensorflow is a bit confusing, with old tutorials spread around the internet using deprecated ways of handling inputs.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Apply ufunc to xarray single Dataset variable as delayed operation using dask - python

Related

produce vector output from a dask array

How should I load a memory-intensive helper object per-worker in dask distributed?

Using shared array in multiprocessing

How to correctly implement apply_async for data processing?

Preprocessing CSV data using tensorflow DataSet API

Categories

Resources