I want to compute the parameters of a statistical distribution fitted over the time dimension of an xarray.DataArray.
I'd like to create a function that does something like:
from scipy import stats
import xarray as xr
def fit(arr):
return xr.apply_ufunc(stats.norm.fit, arr, ...)
that returns a new DataArray storing the two parameters of the distribution computed over the time dimension. So if an input has dimensions (time, lat, lon), fit would return a DataArray with dimensions (params, lat, lon). The next step would be to use these parameters to compute various percentiles (e.g. stats.norm.ppf).
After many unsuccessful trials, I'm doubting apply_ufunc supports this use case and that I should rather do the computation using
params = np.apply_along_axis(stats.norm.fit, arr.get_axis_num('time'), arr.data)
then create the DataArray manually, copying dimensions and attributes.
Thoughts? Suggestions?
Here is what I ended up doing, which feels a bit like a hack:
# Fit the parameters (lazy computation)
data = dask.array.apply_along_axis(dc.fit, arr.get_axis_num('time'), arr)
# Create a DataArray with the desired dimensions to copy them over to the parameter array.
mean = arr.mean(dim='time', keep_attrs=True)
coords = dict(mean.coords.items())
coords['dparams'] = ([] if dc.shapes is None else dc.shapes.split(',')) + ['loc', 'scale']
out = xr.DataArray(data=data, coords=coords, dims=(u'dparams',) + mean.dims)
out.attrs = arr.attrs
Dask array includes an analogue of apply_along_axis, may be the most obvious place to start. Note that each variable of an xarray that has chunks set automatically encapsulate a dask array in the .data attribute. You may even be able to pass the xarray variable directly.
Related
I have a 3-dimensional xarray dataset with the dimensions x, y, and time. Assuming I know that there's a missing observation at timestep n, what would be the best way to insert a timeslice with no-data values?
Here's a working example:
import xarray as xr
import pandas as pd
x = xr.tutorial.load_dataset("air_temperature")
# assuming this is the missing point in time (currently not in the dataset)
missing = "2014-12-31T07:00:00"
# create an "empty" time slice with fillvalues
empty = xr.full_like(x.isel(time=0), -3000)
# fix the time coordinate of the timeslice
empty['time'] = pd.date_range(missing, periods=1)[0]
# before insertion
print(x.time[-5:].values)
# '2014-12-30T18:00:00.000000000' '2014-12-31T00:00:00.000000000'
# '2014-12-31T06:00:00.000000000' '2014-12-31T12:00:00.000000000'
# '2014-12-31T18:00:00.000000000']
# concat and sort time
x2 = xr.concat([x, empty], "time").sortby("time")
# after insertion
print(x2.time[-5:].values)
# ['2014-12-31T00:00:00.000000000' '2014-12-31T06:00:00.000000000'
# '2014-12-31T07:00:00.000000000' '2014-12-31T12:00:00.000000000'
# '2014-12-31T18:00:00.000000000']
The example works fine, but I'm not sure if that's the best (or even the correct) approach.
My concerns are to use this with bigger datasets, and specifically with dask-array backed datasets.
Is there a better way to fill a missing 2d array?
Would it be better to use a dask-backed "fill array" when inserting into a dask-backed dataset?
You might consider using xarray's reindex method with a constant fill_value for this purpose:
import numpy as np
import xarray as xr
x = xr.tutorial.load_dataset("air_temperature")
missing_time = np.datetime64("2014-12-31T07:00:00")
missing_time_da = xr.DataArray([missing_time], dims=["time"], coords=[[missing_time]])
full_time = xr.concat([x.time, missing_time_da], dim="time")
full = x.reindex(time=full_time, fill_value=-3000.0).sortby("time")
I think both your method and the reindex method will automatically use dask-backed arrays if x is dask-backed.
My dataset has 3 dimensions in the order (time, y, x) and I use apply_ufunc to apply a computation along the time dimension. This rearranges the order of the dimensions as (y, x, time). I need to restructure the xarray so its in the (time, y, x) order as the original dataset. How would I go along doing this?
Here is a visual description of what's happening:
Before:
Then I apply my function:
dcube = xr.apply_ufunc(
bc.clip_and_normalize_percentile,
dcube,
input_core_dims=[["time"]],
output_core_dims=[["time"]],
dask = 'allowed',
vectorize=True
)
as expected time is moved to the last dimension:
How do I rearrange this so that its in the order of the original array? Are there parameters that prevent apply_ufunc from moving the dims?
The docs say that
Core dimensions are automatically moved to the last axes of input
variables before applying func, which facilitates using NumPy style
generalized ufuncs
so it's unlikely that there's a way (or any parameters) to prevent that.
What I've been doing is simply call .transpose afterwards to restore the initial order.
In your example, that would look like:
dcube = dcube.transpose("time", ...)
fixing time to be the first dimension and shifting all other ones behind using ....
Would np.swapaxes help?
import numpy as np
aa = np.arange(2*3*4).reshape(2,3,4)
bb = aa.swapaxes(2,0)
print(bb.shape)
print(aa[0,1,2])
print(bb[2,1,0])
seems np.einsum can work too
import numpy as np
aa = np.arange(2*3*4).reshape(2,3,4)
bb = np.einsum('ijk->kji',aa)
print(bb.shape)
print(aa[0,1,2])
print(bb[2,1,0])
I'm trying to use scipy.optimize.curve_fit on a large latitude/longitude/time xarray using dask.distributed as computing backend.
The idea is to run an individual data fitting for every (latitude, longitude) using the time series.
All of this runs fine outside xarray/dask. I tested it using the time series of a single location passed as a pandas dataframe. However, if I try to run the same process on the same (latitude, longitude) directly on the xarray, the curve_fit operation returns the initial parameters.
I am performing this operation using xr.apply_ufunc like so (here I'm providing only the code that is strictly relevant to the problem):
# function to perform the fit
def _fit_rti_curve(data, data_rti, fit, loc=False):
fit_func, linearize, find_init_params = _get_fit_functions(fit)
# remove nans
x, y = _filter_nodata(data_rti, data)
# remove outliers
x, y = _filter_for_outliers(x, y, linearize=linearize)
# find a first guess for maximum achieveable value
yscale = np.max(y) * 1.05
# find a first guess for the other parameters
# here loc can be manually passed if you have a good estimation
init_parms = find_init_params(x, y, yscale, loc=loc, linearize=linearize)
# fit the curve and return parameters
parms = curve_fit(fit_func, x, y, p0=init_parms, maxfev=10000)
parms = parms[0]
return parms
# shell around _fit_rti_curve
def find_rti_func_parms(data, rti, fit):
# sort and fit highest n values
top_data = np.sort(data)
top_data = top_data[-len(rti):]
# convert to float64 if needed
top_data = top_data.astype(np.float64)
rti = rti.astype(np.float64)
# run the fit
parms = _fit_rti_curve(top_data, rti, fit, loc=0) #TODO maybe add function to allow a free loc
return parms
# call for the apply_ufunc
# `fit` is a string that defines the distribution type
# `rti` is an array for the x values
parms_data = xr.apply_ufunc(
find_rti_func_parms,
xr_obj,
input_core_dims=[['time']],
output_core_dims=[[fit + ' parameters']],
output_sizes = {fit + ' parameters': len(signature(fit_func).parameters) - 1},
vectorize=True,
kwargs={'rti':return_time_interval, 'fit':fit},
dask='parallelized',
output_dtypes=['float64']
)
My guess would be that is a problem related to threading, or at least some shared memory space that is not properly passed between workers and scheduler.
However, I am just not knowledgeable enough to test this within dask.
Any idea on this problem?
You should have a look at this issue https://github.com/pydata/xarray/issues/4300
I had the same problem and I solved using apply_ufunc. It is not optimized, since it has to perform rechunking operations, but it works!
I've created a GitHub Gist for it https://gist.github.com/clausmichele/8350e1f7f15e6828f29579914276de71
This previous answer might be helpful? It's using numpy.polyfit but I think the general approach should be similar.
Applying numpy.polyfit to xarray Dataset
Also, I haven't tried it but xr.polyfit() just got merged recently! Could also be something to look into. http://xarray.pydata.org/en/stable/generated/xarray.DataArray.polyfit.html#xarray.DataArray.polyfit
I have a series of N images that are recorded at different times. I have stacked the images into a 3-D dask array and rechunked them along the time axis. I would now like to perform a linear fit at each pixel position across the image, but I am running into the following error when using da.map_blocks as I try to scale up: TypeError: expected 1D or 2D array for y
I found one other post, applying-a-function-along-an-axis-of-a-dask-array, related to this but it didn't address an issue with specifically setting the chunk size. When using da.apply_along_axis I found an issue similar to the one reported in dask-performance-apply-along-axis wherein only one CPU seems to be utilized during the computation (even for chunked data).
MWE: Works properly
import dask.array as da
import numpy as np
import matplotlib.pyplot as plt
plt.style.use('ggplot')
def f(y, args, axis=None):
return np.polyfit(args[0], y.squeeze(), args[1])[:, None, None]
deg = 1
nsamp=20*10*10
shape=(20,10,10)
chunk_size=(20,1,1)
a = da.linspace(1, nsamp, nsamp).reshape(shape)
chunked = a.rechunk(chunk_size)
times = da.linspace(1, shape[0], shape[0])
results = chunked.map_blocks(f, chunks=(20,1,1), args=[times, deg], dtype='float').compute()
m_fit = results[0]
b_fit = results[1]
# Plot a few fits to visually examine them
fig, ax = plt.subplots(nrows=1, ncols=1)
for (x,y) in zip([1,9], [1,9]):
ax.scatter(times, chunked[:,x,y])
ax.plot(times, np.polyval([m_fit[x, y], b_fit[x,y]], times))
The array, chunked, looks like this:
The resulting plot looks like this,
Which is exactly what I would expect and so all is well! However, the issue arises whenever I try to use a chunksize larger than one.
MWE: Raises TypeError
nsamp=20*10*10
shape=(20,10,10)
chunk_size=(20,5,5) # Chunking the data now
a = da.linspace(1,nsamp, nsamp).reshape(shape)
chunked = a.rechunk(chunk_size)
times = da.linspace(1, shape[0], shape[0])
results = chunked.map_blocks(f, chunks=(20,1,1), args=[times, 1], dtype='float') # error
Does anyone have any ideas as to what is happening here?
It looks like maybe your function expects single-dimensional inputs. I wonder if there is a way that you can write a Python function that wraps your function and handles the unpacking and then repacking of one-dimensional inputs. If you can get that function to work on a single numpy array of shape (20, 2, 2) for example then you can probably use Dask to then apply that function across many similarly sized chunks
I'm using xarray with data for which I have measurements and errors.
I store these along a dimension moment in the dataset with coordinates value and variance.
When I compute for example the mean along a dimension I need values and variances to be treated differently as the former should be combined as
mean_values = sum(values)/len(values)
but the latter as
mean_variance = sum(variances**2)/len(variances).
Currently I'm doing this by forming two new datasets and concatinating them. This is very ugly, convoluted and not suited to more complex calculations. I would like to be able to do this kind of operation in one step, perhaps by defining a function taking values and variances as input and then broadcasting the dataset dimension moment onto it.
Given a dataset q_lp with dimensions moment, time, position:
q_lp_av = q_lp.sel(moment='value').mean(dim='time')
q_lp_var = q_lp.sel(moment='variance').reduce(average_of_squares, dim='time')
q_lp = xr.concat([q_lp_common_av, q_lp_common_var], dim='moment')
where average_of_squares is defined by
def average_of_squares(data, axis=None):
sums = np.sum(data**2, axis=axis)
if axis:
return sums/np.shape(data)[axis]**2
return sums/len(data)**2
What better ways are there to handle this?
Is it possible to use xr.apply_ufunc and a my_average function to do this in one step and in-place?
Should I no be putting theses into one dataset together at all? q_lp is later on combined with other quantities, also with dimensions moment, pos and time, into a DataSet.
I'm grateful for discussion, ideas, tips and links to examples.
Edit:
To clarify, I don't like splitting the DataArray, handling each moment seperately and concatenating them again. I would prefer a possibility to do the following (untested pseudocode for illustration):
def multi_moment_average(mean, variance):
mean = np.average(mean)
variance = np.sum(variance**2)/len(variance)
return mean, variance
q_lp.reduce(multi_moment_average, broadcast='moment', dim='time')
Minimal working example:
import numpy as np
import xarray as xr
def average_of_squares(data, axis=None):
sums = np.sum(data**2, axis=axis)
if axis:
return sums/np.shape(data)[axis]**2
return sums/len(data)**2
times = np.arange(10)
positions = np.array([1, 3, 5])
values = np.ones((len(times), len(positions))) * (2 + np.random.rand())
variance = np.ones((len(times), len(positions))) * np.random.rand()
q_lp = xr.DataArray(np.array([values, variance]),
coords=[['value', 'variance'], times, positions],
dims=['moment', 'time', 'position'])
q_lp_av = q_lp.sel(moment='value').mean(dim='time')
q_lp_var = q_lp.sel(moment='variance').reduce(average_of_squares, dim='time')
q_lp = xr.concat([q_lp_av, q_lp_var], dim='moment')
I think you can write your function in an xarray-friendly way, and then call it on your data. i.e.
def average_of_squares(data, dim=None):
sums = (data ** 2).sum(dim)
return sums/data.count(dim)**2
q_lp_var = q_lp.sel(moment='variance').pipe(average_of_squares, dim='time')
Having them concat-ed in the same DataArray is fine; it might be a more natural fit for items on a Dataset, though.
Does that answer your question?
Edit: re the edited question, I think holding the items in a Dataset rather than a DataArray is most coherent with the data structures. It seems like the mean & variance are two different arrays you want aligned on the same indexes, so a Dataset is ideal
I found a solution that suits my needs, but am still grateful for more suggestions:
groupby can seperate a Dataset or DataArray along a specified dimension, list thereof creates (key, value) tuples and dict of this has essentially the form of a keyword dictionary. See http://xarray.pydata.org/en/stable/groupby.html
My current solution thus looks like this:
import xarray as xr
def function_applier(data, function, split_dimension=None, **function_kwargs):
return xr.concat(
function(
**dict(list(data.groupby(split_dimension))),
**function_kwargs),
dim=split_dimension)
Now I can define functions taking specific coordinates as inputs which can be written to also work for e.g. numpy arrays.
(MWE using the specific example of my original question here)
import numpy as np
def average_of_gaussians(val, var, dim=None):
return val.mean(dim), (var ** 2).sum(dim)/var.count(dim)
val = np.random.rand(12).reshape(2,6)
var = 0.1*np.random.rand(12).reshape(2,6)
da = xr.DataArray([val, var],
dims=['moment','time','position'],
coords=[['val','var'],
np.arange(6),
['a','b']])
>>>da
<xarray.DataArray (moment: 2, position: 2, time: 6)>
array([[[0.66233728, 0.71419351, 0.96758741, 0.96949021, 0.94594299,
0.05080628],
[0.44005458, 0.64616657, 0.69865189, 0.84970553, 0.19561433,
0.8529829 ]],
[[0.02209967, 0.02152369, 0.09181031, 0.00223527, 0.01448938,
0.01484197],
[0.05651841, 0.04942305, 0.08250529, 0.04258035, 0.00184209,
0.0957248 ]]])
Coordinates:
* moment (moment) <U3 'val' 'var'
* position (position) <U1 'a' 'b'
* time (time) int32 0 1 2 3 4 5
>>>function_applier(da,
average_of_gaussians,
split_dimension='moment',
dim='time')
<xarray.DataArray (moment: 2, position: 2)>
array([[0.71839295, 0.61386263],
[0.001636 , 0.00390397]])
Coordinates:
* position (position) <U1 'a' 'b'
* moment (moment) object 'val' 'var'
Note the input names equal to the coordinates for average_of_gaussians. The different operation on each variable in one function and the lack of references to xarray within it are the properties I am after.