My dataset has 3 dimensions in the order (time, y, x) and I use apply_ufunc to apply a computation along the time dimension. This rearranges the order of the dimensions as (y, x, time). I need to restructure the xarray so its in the (time, y, x) order as the original dataset. How would I go along doing this?
Here is a visual description of what's happening:
Before:
Then I apply my function:
dcube = xr.apply_ufunc(
bc.clip_and_normalize_percentile,
dcube,
input_core_dims=[["time"]],
output_core_dims=[["time"]],
dask = 'allowed',
vectorize=True
)
as expected time is moved to the last dimension:
How do I rearrange this so that its in the order of the original array? Are there parameters that prevent apply_ufunc from moving the dims?
The docs say that
Core dimensions are automatically moved to the last axes of input
variables before applying func, which facilitates using NumPy style
generalized ufuncs
so it's unlikely that there's a way (or any parameters) to prevent that.
What I've been doing is simply call .transpose afterwards to restore the initial order.
In your example, that would look like:
dcube = dcube.transpose("time", ...)
fixing time to be the first dimension and shifting all other ones behind using ....
Would np.swapaxes help?
import numpy as np
aa = np.arange(2*3*4).reshape(2,3,4)
bb = aa.swapaxes(2,0)
print(bb.shape)
print(aa[0,1,2])
print(bb[2,1,0])
seems np.einsum can work too
import numpy as np
aa = np.arange(2*3*4).reshape(2,3,4)
bb = np.einsum('ijk->kji',aa)
print(bb.shape)
print(aa[0,1,2])
print(bb[2,1,0])
Related
I have a 3-dimensional xarray dataset with the dimensions x, y, and time. Assuming I know that there's a missing observation at timestep n, what would be the best way to insert a timeslice with no-data values?
Here's a working example:
import xarray as xr
import pandas as pd
x = xr.tutorial.load_dataset("air_temperature")
# assuming this is the missing point in time (currently not in the dataset)
missing = "2014-12-31T07:00:00"
# create an "empty" time slice with fillvalues
empty = xr.full_like(x.isel(time=0), -3000)
# fix the time coordinate of the timeslice
empty['time'] = pd.date_range(missing, periods=1)[0]
# before insertion
print(x.time[-5:].values)
# '2014-12-30T18:00:00.000000000' '2014-12-31T00:00:00.000000000'
# '2014-12-31T06:00:00.000000000' '2014-12-31T12:00:00.000000000'
# '2014-12-31T18:00:00.000000000']
# concat and sort time
x2 = xr.concat([x, empty], "time").sortby("time")
# after insertion
print(x2.time[-5:].values)
# ['2014-12-31T00:00:00.000000000' '2014-12-31T06:00:00.000000000'
# '2014-12-31T07:00:00.000000000' '2014-12-31T12:00:00.000000000'
# '2014-12-31T18:00:00.000000000']
The example works fine, but I'm not sure if that's the best (or even the correct) approach.
My concerns are to use this with bigger datasets, and specifically with dask-array backed datasets.
Is there a better way to fill a missing 2d array?
Would it be better to use a dask-backed "fill array" when inserting into a dask-backed dataset?
You might consider using xarray's reindex method with a constant fill_value for this purpose:
import numpy as np
import xarray as xr
x = xr.tutorial.load_dataset("air_temperature")
missing_time = np.datetime64("2014-12-31T07:00:00")
missing_time_da = xr.DataArray([missing_time], dims=["time"], coords=[[missing_time]])
full_time = xr.concat([x.time, missing_time_da], dim="time")
full = x.reindex(time=full_time, fill_value=-3000.0).sortby("time")
I think both your method and the reindex method will automatically use dask-backed arrays if x is dask-backed.
I have a series of N images that are recorded at different times. I have stacked the images into a 3-D dask array and rechunked them along the time axis. I would now like to perform a linear fit at each pixel position across the image, but I am running into the following error when using da.map_blocks as I try to scale up: TypeError: expected 1D or 2D array for y
I found one other post, applying-a-function-along-an-axis-of-a-dask-array, related to this but it didn't address an issue with specifically setting the chunk size. When using da.apply_along_axis I found an issue similar to the one reported in dask-performance-apply-along-axis wherein only one CPU seems to be utilized during the computation (even for chunked data).
MWE: Works properly
import dask.array as da
import numpy as np
import matplotlib.pyplot as plt
plt.style.use('ggplot')
def f(y, args, axis=None):
return np.polyfit(args[0], y.squeeze(), args[1])[:, None, None]
deg = 1
nsamp=20*10*10
shape=(20,10,10)
chunk_size=(20,1,1)
a = da.linspace(1, nsamp, nsamp).reshape(shape)
chunked = a.rechunk(chunk_size)
times = da.linspace(1, shape[0], shape[0])
results = chunked.map_blocks(f, chunks=(20,1,1), args=[times, deg], dtype='float').compute()
m_fit = results[0]
b_fit = results[1]
# Plot a few fits to visually examine them
fig, ax = plt.subplots(nrows=1, ncols=1)
for (x,y) in zip([1,9], [1,9]):
ax.scatter(times, chunked[:,x,y])
ax.plot(times, np.polyval([m_fit[x, y], b_fit[x,y]], times))
The array, chunked, looks like this:
The resulting plot looks like this,
Which is exactly what I would expect and so all is well! However, the issue arises whenever I try to use a chunksize larger than one.
MWE: Raises TypeError
nsamp=20*10*10
shape=(20,10,10)
chunk_size=(20,5,5) # Chunking the data now
a = da.linspace(1,nsamp, nsamp).reshape(shape)
chunked = a.rechunk(chunk_size)
times = da.linspace(1, shape[0], shape[0])
results = chunked.map_blocks(f, chunks=(20,1,1), args=[times, 1], dtype='float') # error
Does anyone have any ideas as to what is happening here?
It looks like maybe your function expects single-dimensional inputs. I wonder if there is a way that you can write a Python function that wraps your function and handles the unpacking and then repacking of one-dimensional inputs. If you can get that function to work on a single numpy array of shape (20, 2, 2) for example then you can probably use Dask to then apply that function across many similarly sized chunks
I'm using xarray with data for which I have measurements and errors.
I store these along a dimension moment in the dataset with coordinates value and variance.
When I compute for example the mean along a dimension I need values and variances to be treated differently as the former should be combined as
mean_values = sum(values)/len(values)
but the latter as
mean_variance = sum(variances**2)/len(variances).
Currently I'm doing this by forming two new datasets and concatinating them. This is very ugly, convoluted and not suited to more complex calculations. I would like to be able to do this kind of operation in one step, perhaps by defining a function taking values and variances as input and then broadcasting the dataset dimension moment onto it.
Given a dataset q_lp with dimensions moment, time, position:
q_lp_av = q_lp.sel(moment='value').mean(dim='time')
q_lp_var = q_lp.sel(moment='variance').reduce(average_of_squares, dim='time')
q_lp = xr.concat([q_lp_common_av, q_lp_common_var], dim='moment')
where average_of_squares is defined by
def average_of_squares(data, axis=None):
sums = np.sum(data**2, axis=axis)
if axis:
return sums/np.shape(data)[axis]**2
return sums/len(data)**2
What better ways are there to handle this?
Is it possible to use xr.apply_ufunc and a my_average function to do this in one step and in-place?
Should I no be putting theses into one dataset together at all? q_lp is later on combined with other quantities, also with dimensions moment, pos and time, into a DataSet.
I'm grateful for discussion, ideas, tips and links to examples.
Edit:
To clarify, I don't like splitting the DataArray, handling each moment seperately and concatenating them again. I would prefer a possibility to do the following (untested pseudocode for illustration):
def multi_moment_average(mean, variance):
mean = np.average(mean)
variance = np.sum(variance**2)/len(variance)
return mean, variance
q_lp.reduce(multi_moment_average, broadcast='moment', dim='time')
Minimal working example:
import numpy as np
import xarray as xr
def average_of_squares(data, axis=None):
sums = np.sum(data**2, axis=axis)
if axis:
return sums/np.shape(data)[axis]**2
return sums/len(data)**2
times = np.arange(10)
positions = np.array([1, 3, 5])
values = np.ones((len(times), len(positions))) * (2 + np.random.rand())
variance = np.ones((len(times), len(positions))) * np.random.rand()
q_lp = xr.DataArray(np.array([values, variance]),
coords=[['value', 'variance'], times, positions],
dims=['moment', 'time', 'position'])
q_lp_av = q_lp.sel(moment='value').mean(dim='time')
q_lp_var = q_lp.sel(moment='variance').reduce(average_of_squares, dim='time')
q_lp = xr.concat([q_lp_av, q_lp_var], dim='moment')
I think you can write your function in an xarray-friendly way, and then call it on your data. i.e.
def average_of_squares(data, dim=None):
sums = (data ** 2).sum(dim)
return sums/data.count(dim)**2
q_lp_var = q_lp.sel(moment='variance').pipe(average_of_squares, dim='time')
Having them concat-ed in the same DataArray is fine; it might be a more natural fit for items on a Dataset, though.
Does that answer your question?
Edit: re the edited question, I think holding the items in a Dataset rather than a DataArray is most coherent with the data structures. It seems like the mean & variance are two different arrays you want aligned on the same indexes, so a Dataset is ideal
I found a solution that suits my needs, but am still grateful for more suggestions:
groupby can seperate a Dataset or DataArray along a specified dimension, list thereof creates (key, value) tuples and dict of this has essentially the form of a keyword dictionary. See http://xarray.pydata.org/en/stable/groupby.html
My current solution thus looks like this:
import xarray as xr
def function_applier(data, function, split_dimension=None, **function_kwargs):
return xr.concat(
function(
**dict(list(data.groupby(split_dimension))),
**function_kwargs),
dim=split_dimension)
Now I can define functions taking specific coordinates as inputs which can be written to also work for e.g. numpy arrays.
(MWE using the specific example of my original question here)
import numpy as np
def average_of_gaussians(val, var, dim=None):
return val.mean(dim), (var ** 2).sum(dim)/var.count(dim)
val = np.random.rand(12).reshape(2,6)
var = 0.1*np.random.rand(12).reshape(2,6)
da = xr.DataArray([val, var],
dims=['moment','time','position'],
coords=[['val','var'],
np.arange(6),
['a','b']])
>>>da
<xarray.DataArray (moment: 2, position: 2, time: 6)>
array([[[0.66233728, 0.71419351, 0.96758741, 0.96949021, 0.94594299,
0.05080628],
[0.44005458, 0.64616657, 0.69865189, 0.84970553, 0.19561433,
0.8529829 ]],
[[0.02209967, 0.02152369, 0.09181031, 0.00223527, 0.01448938,
0.01484197],
[0.05651841, 0.04942305, 0.08250529, 0.04258035, 0.00184209,
0.0957248 ]]])
Coordinates:
* moment (moment) <U3 'val' 'var'
* position (position) <U1 'a' 'b'
* time (time) int32 0 1 2 3 4 5
>>>function_applier(da,
average_of_gaussians,
split_dimension='moment',
dim='time')
<xarray.DataArray (moment: 2, position: 2)>
array([[0.71839295, 0.61386263],
[0.001636 , 0.00390397]])
Coordinates:
* position (position) <U1 'a' 'b'
* moment (moment) object 'val' 'var'
Note the input names equal to the coordinates for average_of_gaussians. The different operation on each variable in one function and the lack of references to xarray within it are the properties I am after.
I have a question on the resampling 2-d array.
Sometimes, the original size of the geoscience data should be transformed to other size. If the ratio for each axis is equal, the task is simple, in which np.reshape allow a 2-d array of 100x100 to 50x50 without data loss. The code is shown as:
## creat a original data
xc1, xc2, yc1, yc2 = 100, 110, 35, 45
XSIZE,YSIZE=100,100
lon,lat = np.linspace(xc1,xc2,XSIZE),np.linspace(yc1,yc2,YSIZE)
pop = np.random.uniform(low=1000, high=50000, size=(XSIZE*YSIZE,)).reshape(YSIZE,XSIZE)
## reshape
shape = np.array(pop.shape, dtype=float)
coarseness = 2 # the new shape is in 50 x 50
new_shape = coarseness * np.ceil(shape/coarseness).astype(int)
zp_pop = np.zeros(new_shape)
zp_pop[:int(shape[0]), :int(shape[1])] = pop
temp = zp_pop.reshape((new_shape[0] // coarseness, coarseness,
new_shape[1] // coarseness, coarseness))
coarse_pop = np.sum(temp, axis=(1,3))
print (pop.sum())
print (coarse_pop.sum())
However, when the coarse factor is different for each axis, this method can not be implemented. I turned to apply other method. Here is an example I tried to use FFT to generate a 60*80 array as output
from scipy import fftpack
pop_fft = fftpack.fft2(pop,shape = (60,80))
pop_res = fftpack.ifft2(pop_fft).real
print(pop.sum())
print(pop_res.sum())
254208134.8356425
122048754.13639387
The data loss was significant. Thus, I posted my issue here. Maybe the resampling function I used was not correct. Or there are some better approach to deal with this situation. Any advices or comments are highly appreciated!
When you set up the 'coarse array' yourself you sum over adjacent entries, instead of computing the average or interpolating.
This way the sum over all elements in the coarse and original array are identical str((coarse_pop.sum()-pop.sum())/(0.5*(pop.sum()+coarse_pop.sum()))) gives '-1.1638426077573779e-16' only a tiny numerical error.
if you compare the mean of the fftpack resampled coarse array it matches up:
print(pop.mean())
print(pop_res.mean())
25606.832220313503
25496.03271480075
alternatively you can correct for the number of elements yourself:
print(pop.sum())
print(pop_res.sum()*100*100/(60*80))
256068322.20313504
254960327.14800745
I don't know about your problem but the fftpack way of downsampling the array makes more sense to me. if it's not what you want you can apply the prefactor to the original array, like pop_fft = fftpack.fft2(pop*100*100/(60*80),shape = (60,80))
I want to compute the parameters of a statistical distribution fitted over the time dimension of an xarray.DataArray.
I'd like to create a function that does something like:
from scipy import stats
import xarray as xr
def fit(arr):
return xr.apply_ufunc(stats.norm.fit, arr, ...)
that returns a new DataArray storing the two parameters of the distribution computed over the time dimension. So if an input has dimensions (time, lat, lon), fit would return a DataArray with dimensions (params, lat, lon). The next step would be to use these parameters to compute various percentiles (e.g. stats.norm.ppf).
After many unsuccessful trials, I'm doubting apply_ufunc supports this use case and that I should rather do the computation using
params = np.apply_along_axis(stats.norm.fit, arr.get_axis_num('time'), arr.data)
then create the DataArray manually, copying dimensions and attributes.
Thoughts? Suggestions?
Here is what I ended up doing, which feels a bit like a hack:
# Fit the parameters (lazy computation)
data = dask.array.apply_along_axis(dc.fit, arr.get_axis_num('time'), arr)
# Create a DataArray with the desired dimensions to copy them over to the parameter array.
mean = arr.mean(dim='time', keep_attrs=True)
coords = dict(mean.coords.items())
coords['dparams'] = ([] if dc.shapes is None else dc.shapes.split(',')) + ['loc', 'scale']
out = xr.DataArray(data=data, coords=coords, dims=(u'dparams',) + mean.dims)
out.attrs = arr.attrs
Dask array includes an analogue of apply_along_axis, may be the most obvious place to start. Note that each variable of an xarray that has chunks set automatically encapsulate a dask array in the .data attribute. You may even be able to pass the xarray variable directly.