xarray equivalent of pandas `qcut()` function - python

I want to calculate the Decile Index - see the ex1-Calculate Decile Index (DI) with Python.ipynb.
The pandas implementation is simple enough but I need help with applying the bin labels to a new variable / coordinate using the groupby_bins() functionality.
working example (test dataset)
import pandas as pd
import numpy as np
import xarray as xr
time = pd.date_range('2010-01-01','2011-12-31',freq='M')
lat = np.linspace(-5.175003, -4.7250023, 10)
lon = np.linspace(33.524994, 33.97499, 10)
precip = np.random.normal(0, 1, size=(len(time), len(lat), len(lon)))
ds = xr.Dataset(
{'precip': (['time', 'lat', 'lon'], precip)},
coords={
'lon': lon,
'lat': lat,
'time': time,
}
)
This looks like:
Out[]:
<xarray.Dataset>
Dimensions: (lat: 10, lon: 10, time: 24)
Coordinates:
* lon (lon) float64 33.52 33.57 33.62 33.67 ... 33.82 33.87 33.92 33.97
* lat (lat) float64 -5.175 -5.125 -5.075 -5.025 ... -4.825 -4.775 -4.725
* time (time) datetime64[ns] 2010-01-31 2010-02-28 ... 2011-12-31
Data variables:
precip (time, lat, lon) float64 0.1638 -1.031 0.2087 ... -0.1147 -0.6863
Calculating the cumulative frequency distribution (normalised rank)
# calculate a cumsum over some window size
rolling_window = 3
ds_window = (
ds.rolling(time=rolling_window, center=True)
.sum()
.dropna(dim='time', how='all')
)
# construct a cumulative frequency distribution ranking the precip values
# per month
def rank_norm(ds, dim='time'):
return (ds.rank(dim=dim) - 1) / (ds.sizes[dim] - 1) * 100
result = ds_window.groupby('time.month').apply(rank_norm, args=('time',))
result = result.rename({variable:'rank_norm'}).drop('month')
Out[]:
<xarray.Dataset>
Dimensions: (lat: 10, lon: 10, time: 108)
Coordinates:
* lat (lat) float64 -5.175 -5.125 -5.075 ... -4.825 -4.775 -4.725
* lon (lon) float64 33.52 33.57 33.62 33.67 ... 33.82 33.87 33.92 33.97
* time (time) datetime64[ns] 2010-01-31 2010-02-28 ... 2018-12-31
Data variables:
rank_norm (time, lat, lon) float64 75.0 75.0 12.5 100.0 ... 87.5 0.0 25.0
Pandas Solution
I want to create a variable which will create a new variable or coordinate
in ds that will have the the integers corresponding to the bins from the bins = [20., 40., 60., 80., np.Inf].
Trying to do it in Pandas is relatively simple with the .qcut functionality.
test = result.to_dataframe()
bins = pd.qcut(test['rank_norm'], 5, labels=[1, 2, 3, 4, 5])
result = bins.to_xarray().to_dataset().rename({'rank_norm': 'rank_bins'})
Out[]:
<xarray.Dataset>
Dimensions: (lat: 10, lon: 10, time: 108)
Coordinates:
* lat (lat) float64 -5.175 -5.125 -5.075 -5.025 ... -4.825 -4.775 -4.725
* lon (lon) float64 33.52 33.57 33.62 33.67 ... 33.82 33.87 33.92 33.97
* time (time) datetime64[ns] 2010-01-31 2010-02-28 ... 2018-12-31
Data variables:
rank_bins (lat, lon, time) int64 4 4 1 4 3 4 5 1 1 2 ... 2 1 1 4 2 4 3 1 2 2
My xarray attempt
# assign bins to variable xarray
bins = [20., 40., 60., 80., np.Inf]
decile_index_gpby = rank_norm.groupby_bins('rank_norm', bins=bins)
out = decile_index_gpby.assign() # assign_coords()
The error message I get is as follows:
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
<ipython-input-166-8d48b9fc1d56> in <module>
1 bins = [20., 40., 60., 80., np.Inf]
2 decile_index_gpby = rank_norm.groupby_bins('rank_norm', bins=bins)
----> 3 out = decile_index_gpby.assign() # assign_coords()
~/miniconda3/lib/python3.7/site-packages/xarray/core/groupby.py in assign(self, **kwargs)
772 Dataset.assign
773 """
--> 774 return self.apply(lambda ds: ds.assign(**kwargs))
775
776
~/miniconda3/lib/python3.7/site-packages/xarray/core/groupby.py in apply(self, func, args, **kwargs)
684 kwargs.pop('shortcut', None) # ignore shortcut if set (for now)
685 applied = (func(ds, *args, **kwargs) for ds in self._iter_grouped())
--> 686 return self._combine(applied)
687
688 def _combine(self, applied):
~/miniconda3/lib/python3.7/site-packages/xarray/core/groupby.py in _combine(self, applied)
691 coord, dim, positions = self._infer_concat_args(applied_example)
692 combined = concat(applied, dim)
--> 693 combined = _maybe_reorder(combined, dim, positions)
694 if coord is not None:
695 combined[coord.name] = coord
~/miniconda3/lib/python3.7/site-packages/xarray/core/groupby.py in _maybe_reorder(xarray_obj, dim, positions)
468
469 def _maybe_reorder(xarray_obj, dim, positions):
--> 470 order = _inverse_permutation_indices(positions)
471
472 if order is None:
~/miniconda3/lib/python3.7/site-packages/xarray/core/groupby.py in _inverse_permutation_indices(positions)
110 positions = [np.arange(sl.start, sl.stop, sl.step) for sl in positions]
111
--> 112 indices = nputils.inverse_permutation(np.concatenate(positions))
113 return indices
114
~/miniconda3/lib/python3.7/site-packages/xarray/core/nputils.py in inverse_permutation(indices)
58 # use intp instead of int64 because of windows :(
59 inverse_permutation = np.empty(len(indices), dtype=np.intp)
---> 60 inverse_permutation[indices] = np.arange(len(indices), dtype=np.intp)
61 return inverse_permutation
62
IndexError: index 1304 is out of bounds for axis 0 with size 1000

I'm not sure pandas.qcut is giving you exactly what you expect; e.g. see the bins it returns in your example:
>>> test = result.to_dataframe()
>>> binned, bins = pd.qcut(test['rank_norm'], 5, labels=[1, 2, 3, 4, 5], retbins=True)
>>> bins
array([ 0. , 12.5, 37.5, 62.5, 87.5, 100. ])
If I understand correctly, you are looking to assign an integer value at each point based on the bin the point falls into. That is:
0.0 <= x < 20.0: 1
20.0 <= x < 40.0: 2
40.0 <= x < 60.0: 3
60.0 <= x < 80.0: 4
80.0 <= x: 5
For this task I would probably recommend using numpy.digitize applied via xarray.apply_ufunc:
>>> bins = [0., 20., 40., 60., 80., np.inf]
>>> result = xr.apply_ufunc(np.digitize, result, kwargs={'bins': bins})

It looks like if you use a scalar to define your bins then it will only generate 4 ranges. You can check this by looking at the length and the name of the keys of the groups of the resulting GroupBy object:
mybins = [20., 40., 60., 80., np.inf]
decile_index_gpby = rank_norm.groupby_bins('rank_norm', bins=mybins)
len(decile_index_gpby.groups)
=> 4
decile_index_gpby.groups.keys()
=> [Interval(80.0, inf, closed='right'),
Interval(20.0, 40.0, closed='right'),
Interval(60.0, 80.0, closed='right'),
Interval(40.0, 60.0, closed='right')]
To prevent the loss of 1/5th of the values, you would have to change your definition of mybins to something like:
mybins = [np.NINF, 20., 40., 60., np.inf]
which is not what you want.
So use bins=5 instead:
decile_index_gpby = rank_norm.groupby_bins('rank_norm', bins=5)
len(decile_index_gpby.groups)
=> 5
decile_index_gpby.groups.keys()
=> [Interval(80.0, 100.0, closed='right'),
Interval(20.0, 40.0, closed='right'),
Interval(60.0, 80.0, closed='right'),
Interval(40.0, 60.0, closed='right'),
Interval(-0.1, 20.0, closed='right')]

Related

OSError when extracting values from xarray DataArray

I have a dataset containing windspeeds at multiple pressure levels for 3 consecutive months:
import xarray as xr
da = xr.open_dataset('autumn_data.grib', engine='cfgrib')
In[1]: da
Out[1]:
<xarray.Dataset>
Dimensions: (time: 2208, isobaricInhPa: 11, latitude: 161, longitude: 401)
Coordinates:
number int32 ...
* time (time) datetime64[ns] 2020-08-01 ... 2020-10-31T23:00:00
step timedelta64[ns] ...
* isobaricInhPa (isobaricInhPa) float64 1e+03 950.0 900.0 ... 550.0 500.0
* latitude (latitude) float64 70.0 69.75 69.5 69.25 ... 30.5 30.25 30.0
* longitude (longitude) float64 -90.0 -89.75 -89.5 ... 9.5 9.75 10.0
valid_time (time) datetime64[ns] ...
Data variables:
u (time, isobaricInhPa, latitude, longitude) float32 ...
v (time, isobaricInhPa, latitude, longitude) float32 ...
Attributes:
GRIB_edition: 1
GRIB_centre: ecmf
GRIB_centreDescription: European Centre for Medium-Range Weather Forecasts
GRIB_subCentre: 0
Conventions: CF-1.7
institution: European Centre for Medium-Range Weather Forecasts
history: 2022-06-19T13:42 GRIB to CDM+CF via cfgrib-0.9.1...
I loop over this dataset to make a numpy array with windspeeds in both directions for 40 consecutive hours.
import numpy as np
RUNTIME = 40
TIMES = da.time.values
for t in range(len(TIMES)-RUNTIME+1)):
WIND = np.stack((
da['u'].sel(time = xr.DataArray(TIMES[t:t+RUNTIME])).values,
da['v'].sel(time = xr.DataArray(TIMES[t:t+RUNTIME])).values
))
This worked fine, until some point where I got an error.
In[2]: da['u'].sel(time = xr.DataArray(TIMES[716:716+RUNTIME])).values
Traceback (most recent call last):
Input In [2] in <cell line: 1>
da['u'].sel(time = xr.DataArray(TIMES[716:716+RUNTIME])).values
File ~\anaconda3\envs\thesis\lib\site-packages\xarray\core\dataarray.py:646 in values
return self.variable.values
File ~\anaconda3\envs\thesis\lib\site-packages\xarray\core\variable.py:519 in values
return _as_array_or_item(self._data)
File ~\anaconda3\envs\thesis\lib\site-packages\xarray\core\variable.py:259 in _as_array_or_item
data = np.asarray(data)
File ~\anaconda3\envs\thesis\lib\site-packages\xarray\core\indexing.py:551 in __array__
self._ensure_cached()
File ~\anaconda3\envs\thesis\lib\site-packages\xarray\core\indexing.py:548 in _ensure_cached
self.array = NumpyIndexingAdapter(np.asarray(self.array))
File ~\anaconda3\envs\thesis\lib\site-packages\xarray\core\indexing.py:521 in __array__
return np.asarray(self.array, dtype=dtype)
File ~\anaconda3\envs\thesis\lib\site-packages\xarray\core\indexing.py:422 in __array__
return np.asarray(array[self.key], dtype=None)
File ~\anaconda3\envs\thesis\lib\site-packages\cfgrib\xarray_plugin.py:144 in __getitem__
return xr.core.indexing.explicit_indexing_adapter(
File ~\anaconda3\envs\thesis\lib\site-packages\xarray\core\indexing.py:711 in explicit_indexing_adapter
result = raw_indexing_method(raw_key.tuple)
File ~\anaconda3\envs\thesis\lib\site-packages\cfgrib\xarray_plugin.py:150 in _getitem
return self.array[key]
File ~\anaconda3\envs\thesis\lib\site-packages\cfgrib\dataset.py:342 in __getitem__
message = self.index.get_field(message_ids[0]) # type: ignore
File ~\anaconda3\envs\thesis\lib\site-packages\cfgrib\messages.py:472 in get_field
return ComputedKeysAdapter(self.fieldset[message_id], self.computed_keys)
File ~\anaconda3\envs\thesis\lib\site-packages\cfgrib\messages.py:332 in __getitem__
return self.message_from_file(file, offset=item)
File ~\anaconda3\envs\thesis\lib\site-packages\cfgrib\messages.py:328 in message_from_file
return Message.from_file(file, offset, **kwargs)
File ~\anaconda3\envs\thesis\lib\site-packages\cfgrib\messages.py:91 in from_file
file.seek(offset)
OSError: [Errno 22] Invalid argument
This seems very strange, because I didn't get this error in the previous loops and because I can create the corresponding DataArray.
In[3]: da['u'].sel(time = xr.DataArray(TIMES[716:716+RUNTIME]))
Out[3]:
<xarray.DataArray 'u' (dim_0: 40, isobaricInhPa: 11, latitude: 161, longitude: 401)>
[28406840 values with dtype=float32]
Coordinates:
number int32 0
time (dim_0) datetime64[ns] 2020-08-30T20:00:00 ... 2020-09-01T...
step timedelta64[ns] 00:00:00
* isobaricInhPa (isobaricInhPa) float64 1e+03 950.0 900.0 ... 550.0 500.0
* latitude (latitude) float64 70.0 69.75 69.5 69.25 ... 30.5 30.25 30.0
* longitude (longitude) float64 -90.0 -89.75 -89.5 ... 9.5 9.75 10.0
valid_time (dim_0) datetime64[ns] 2020-08-30T20:00:00 ... 2020-09-01T...
Dimensions without coordinates: dim_0
Attributes:
GRIB_paramId: 131
GRIB_dataType: an
GRIB_numberOfPoints: 64561
GRIB_typeOfLevel: isobaricInhPa
GRIB_stepUnits: 1
GRIB_stepType: instant
GRIB_gridType: regular_ll
GRIB_NV: 0
GRIB_Nx: 401
GRIB_Ny: 161
GRIB_cfName: eastward_wind
GRIB_cfVarName: u
GRIB_gridDefinitionDescription: Latitude/Longitude Grid
GRIB_iDirectionIncrementInDegrees: 0.25
GRIB_iScansNegatively: 0
GRIB_jDirectionIncrementInDegrees: 0.25
GRIB_jPointsAreConsecutive: 0
GRIB_jScansPositively: 0
GRIB_latitudeOfFirstGridPointInDegrees: 70.0
GRIB_latitudeOfLastGridPointInDegrees: 30.0
GRIB_longitudeOfFirstGridPointInDegrees: -90.0
GRIB_longitudeOfLastGridPointInDegrees: 10.0
GRIB_missingValue: 9999
GRIB_name: U component of wind
GRIB_shortName: u
GRIB_totalNumber: 0
GRIB_units: m s**-1
long_name: U component of wind
units: m s**-1
standard_name: eastward_wind
Why does this occur? And how can I fix this?

Python Change Dimension and Coordinates Xarray Dataset

I have a xarray Dataset that looks like this below. I need to be able to plot by latitude and longitude any of the three variables in "Data variables: si10, si10_u, avg". However, I cannot figure out how to change the dimensions to latitude and longitude from index_id. Or, to delete "index_id" in Coordinates. I've tried that and then 'latitude' and 'longitude' disappear from "Coordinates". Thank you for suggestions.
Here is my xarray Dataset:
<xarray.Dataset>
Dimensions: (index: 2448, index_id: 2448)
Coordinates:
* index_id (index_id) MultiIndex
- latitude (index_id) float64 58.0 58.0 58.0 58.0 ... 23.0 23.0 23.0 23.0
- longitude (index_id) float64 -130.0 -129.0 -128.0 ... -65.0 -64.0 -63.0
Dimensions without coordinates: index
Data variables:
si10 (index) float32 1.7636629 1.899161 ... 5.9699616 5.9121003
si10_u (index) float32 1.6784391 1.7533684 ... 6.13361 6.139127
avg (index) float32 1.721051 1.8262646 ... 6.0517855 6.025614
You have two issues. First, you need to replace 'index' with 'index_id' so your data is indexed consistently. Second, to unstack 'index_id', you're looking for xr.Dataset.unstack:
ds = ds.unstack('index_id')
As an example... here's a dataset like yours
In [16]: y = np.arange(58, 23, -1)
...: x = np.arange(-130, -63, 1)
In [17]: ds = xr.Dataset(
...: data_vars={
...: v: (("index",), np.random.random(len(x) * len(y)))
...: for v in ["si10", "si10_u", "avg"]
...: },
...: coords={
...: "index_id": pd.MultiIndex.from_product(
...: [y, x], names=["latitude", "longitude"],
...: ),
...: },
...: )
In [18]: ds
Out[18]:
<xarray.Dataset>
Dimensions: (index: 2345, index_id: 2345)
Coordinates:
* index_id (index_id) MultiIndex
- latitude (index_id) int64 58 58 58 58 58 58 58 58 ... 24 24 24 24 24 24 24
- longitude (index_id) int64 -130 -129 -128 -127 -126 ... -68 -67 -66 -65 -64
Dimensions without coordinates: index
Data variables:
si10 (index) float64 0.9412 0.7395 0.6843 ... 0.03979 0.4259 0.09203
si10_u (index) float64 0.7359 0.1984 0.5919 ... 0.5535 0.2867 0.4093
avg (index) float64 0.04257 0.1442 0.008705 ... 0.1911 0.2669 0.1498
First, reorganize your data to have consistent dims:
In [19]: index_id = ds['index_id']
In [20]: ds = (
...: ds.drop("index_id")
...: .rename({"index": "index_id"})
...: .assign_coords(index_id=index_id)
...: )
Then, ds.unstack reorganizes the data to be the combinatorial product of all dimensions in the MultiIndex:
In [21]: ds.unstack("index_id")
Out[21]:
<xarray.Dataset>
Dimensions: (latitude: 35, longitude: 67)
Coordinates:
* latitude (latitude) int64 24 25 26 27 28 29 30 31 ... 52 53 54 55 56 57 58
* longitude (longitude) int64 -130 -129 -128 -127 -126 ... -67 -66 -65 -64
Data variables:
si10 (latitude, longitude) float64 0.9855 0.1467 ... 0.6569 0.9479
si10_u (latitude, longitude) float64 0.4672 0.2664 ... 0.4894 0.128
avg (latitude, longitude) float64 0.3738 0.01793 ... 0.1264 0.21

xarray set new 2D coordinate as dimension

I have an xarray dataset of sea surface temperature values on an x/y grid. x and y are 1D vector coordinates, so it looks like this minimal example:
<xarray.Dataset>
Dimensions: (x: 10, y: 10)
Coordinates:
* x (x) int64 0 1 2 3 4 5 6 7 8 9
* y (y) int64 0 1 2 3 4 5 6 7 8 9
Data variables:
data (x, y) float64 0.559 0.01037 0.1562 ... 0.08778 0.3272 0.8661
I am able to compute the lat/lon from this x/y grid, and the output is 2 2D arrays. I can add them as coordinates with ds.assign_coords:
<xarray.Dataset>
Dimensions: (x: 10, y: 10)
Coordinates:
* x (x) int64 0 1 2 3 4 5 6 7 8 9
* y (y) int64 0 1 2 3 4 5 6 7 8 9
lat (x, y) float64 30.0 30.0 30.0 30.0 30.0 ... 39.0 39.0 39.0 39.0
lon (x, y) float64 -120.0 -119.0 -118.0 -117.0 ... -113.0 -112.0 -111.0
Data variables:
data (x, y) float64 0.559 0.01037 0.1562 ... 0.08778 0.3272 0.8661
But I'd like to .sel along slices of the lat/lon. This currently isn't possible, as I get the error:
ds.sel(lat=slice(32,36), lon=slice(-118, -115))
ValueError Traceback (most recent call last)
<ipython-input-20-28c79202d5f3> in <module>
----> 1 ds.sel(lat=slice(32,36), lon=slice(-118, -115))
~/.local/lib/python3.8/site-packages/xarray/core/dataset.py in sel(self, indexers, method, tolerance, drop, **indexers_kwargs)
2363 """
2364 indexers = either_dict_or_kwargs(indexers, indexers_kwargs, "sel")
-> 2365 pos_indexers, new_indexes = remap_label_indexers(
2366 self, indexers=indexers, method=method, tolerance=tolerance
2367 )
~/.local/lib/python3.8/site-packages/xarray/core/coordinates.py in remap_label_indexers(obj, indexers, method, tolerance, **indexers_kwargs)
419 }
420
--> 421 pos_indexers, new_indexes = indexing.remap_label_indexers(
422 obj, v_indexers, method=method, tolerance=tolerance
423 )
~/.local/lib/python3.8/site-packages/xarray/core/indexing.py in remap_label_indexers(data_obj, indexers, method, tolerance)
256 new_indexes = {}
257
--> 258 dim_indexers = get_dim_indexers(data_obj, indexers)
259 for dim, label in dim_indexers.items():
260 try:
~/.local/lib/python3.8/site-packages/xarray/core/indexing.py in get_dim_indexers(data_obj, indexers)
222 ]
223 if invalid:
--> 224 raise ValueError(f"dimensions or multi-index levels {invalid!r} do not exist")
225
226 level_indexers = defaultdict(dict)
ValueError: dimensions or multi-index levels ['lat', 'lon'] do not exist
So my question is this: How can I change the dimensions of data to be (lon: (10,10), lat: (10,10)) instead of (x: 10, y: 10? Is this even possible?
Code to reproduce the example dataset:
import numpy as np
import xarray as xr
# Create sample data
data = np.random.rand(10,10)
x = y = np.arange(10)
# Set up dataset
ds = xr.Dataset(
data_vars = dict(
data = (["x", "y"], data)
),
coords= {
"x" : x,
"y" : y
}
)
# Create example lat/lon and assign to dataset
lon, lat = np.meshgrid(np.linspace(-120, -111, 10), np.linspace(30, 39, 10))
ds = ds.assign_coords({
"lat": (["x", "y"], lat),
"lon": (["x", "y"], lon)
})

Efficient way to stack Dask Arrays generated from Xarray

So I am trying to read a large amount of relatively large netCDF files containing hydrologic data. The NetCDF files all look like this:
<xarray.Dataset>
Dimensions: (feature_id: 2729077, reference_time: 1, time: 1)
Coordinates:
* time (time) datetime64[ns] 1993-01-11T21:00:00
* reference_time (reference_time) datetime64[ns] 1993-01-01
* feature_id (feature_id) int32 101 179 181 183 185 843 845 847 849 ...
Data variables:
streamflow (feature_id) float64 dask.array<shape=(2729077,), chunksize=(50000,)>
q_lateral (feature_id) float64 dask.array<shape=(2729077,), chunksize=(50000,)>
velocity (feature_id) float64 dask.array<shape=(2729077,), chunksize=(50000,)>
qSfcLatRunoff (feature_id) float64 dask.array<shape=(2729077,), chunksize=(50000,)>
qBucket (feature_id) float64 dask.array<shape=(2729077,), chunksize=(50000,)>
qBtmVertRunoff (feature_id) float64 dask.array<shape=(2729077,), chunksize=(50000,)>
Attributes:
featureType: timeSeries
proj4: +proj=longlat +datum=NAD83 +no_defs
model_initialization_time: 1993-01-01_00:00:00
station_dimension: feature_id
model_output_valid_time: 1993-01-11_21:00:00
stream_order_output: 1
cdm_datatype: Station
esri_pe_string: GEOGCS[GCS_North_American_1983,DATUM[D_North_...
Conventions: CF-1.6
model_version: NWM 1.2
dev_OVRTSWCRT: 1
dev_NOAH_TIMESTEP: 3600
dev_channel_only: 0
dev_channelBucket_only: 0
dev: dev_ prefix indicates development/internal me...
I have 25 years worth of this data, and it is recorded hourly. So there is about 4 TB of data total.
Right now I am just trying to get seasonal averages (Daily and Monthly) of the streamflow values. So I created the following script.
import xarray as xr
import dask.array as da
from dask.distributed import Client
import os
workdir = '/path/to/directory/of/files'
files = [os.path.join(workdir, i) for i in os.listdir(workdir)]
client = Client(processes=False, threads_per_worker=4, n_workers=4, memory_limit='750MB')
big_array = []
for i, file in enumerate(files):
ds = xr.open_dataset(file, chunks={"feature_id": 50000})
if i == 0:
print(ds)
print(ds.streamflow)
big_array.append(ds.streamflow)
ds.close()
if i == 5:
break
dask_big_array = da.stack(big_array, axis=0)
print(dask_big_array)
The ds.streamflow object looks like this when printed, and from what I understand it is just a Dask array:
<xarray.DataArray 'streamflow' (feature_id: 2729077)>
dask.array<shape=(2729077,), dtype=float64, chunksize=(50000,)>
Coordinates:
* feature_id (feature_id) int32 101 179 181 183 185 843 845 847 849 851 ...
Attributes:
long_name: River Flow
units: m3 s-1
coordinates: latitude longitude
valid_range: [ 0 50000000]
The weird thing is that when I stack the arrays, they seem to lose the chunking that I applied to them earlier. When I print out the big_array object I get this:
dask.array<stack, shape=(6, 2729077), dtype=float64, chunksize=(1, 2729077)>
The problem I am running into is when I try to run this code I get this warning, and then I think the memory gets overloaded so I have to kill the process.
distributed.worker - WARNING - Memory use is high but worker has no data to store to disk...
So I guess I have a few questions:
Why is the dask array losing the chunking when stacked?
Is there a more efficient way to stack all of these arrays to parallelize this process?
From the comments, this is what big array is:
[<xarray.DataArray 'streamflow' (feature_id: 2729077)>
dask.array<shape=(2729077,), dtype=float64, chunksize=(50000,)>
Coordinates:
* feature_id (feature_id) int32 101 179 181 183 185 843 845 847 849 851 ...
Attributes:
long_name: River Flow
units: m3 s-1
coordinates: latitude longitude
valid_range: [ 0 50000000], <xarray.DataArray 'streamflow' (feature_id: 2729077)>
dask.array<shape=(2729077,), dtype=float64, chunksize=(50000,)>
Coordinates:
* feature_id (feature_id) int32 101 179 181 183 185 843 845 847 849 851 ...
Attributes:
long_name: River Flow
units: m3 s-1
coordinates: latitude longitude
valid_range: [ 0 50000000], <xarray.DataArray 'streamflow' (feature_id: 2729077)>
dask.array<shape=(2729077,), dtype=float64, chunksize=(50000,)>
Coordinates:
* feature_id (feature_id) int32 101 179 181 183 185 843 845 847 849 851 ...
Attributes:
long_name: River Flow
units: m3 s-1
coordinates: latitude longitude
valid_range: [ 0 50000000], <xarray.DataArray 'streamflow' (feature_id: 2729077)>
dask.array<shape=(2729077,), dtype=float64, chunksize=(50000,)>
Coordinates:
* feature_id (feature_id) int32 101 179 181 183 185 843 845 847 849 851 ...
Attributes:
long_name: River Flow
units: m3 s-1
coordinates: latitude longitude
valid_range: [ 0 50000000], <xarray.DataArray 'streamflow' (feature_id: 2729077)>
dask.array<shape=(2729077,), dtype=float64, chunksize=(50000,)>
Coordinates:
* feature_id (feature_id) int32 101 179 181 183 185 843 845 847 849 851 ...
Attributes:
long_name: River Flow
units: m3 s-1
coordinates: latitude longitude
valid_range: [ 0 50000000], <xarray.DataArray 'streamflow' (feature_id: 2729077)>
dask.array<shape=(2729077,), dtype=float64, chunksize=(50000,)>
Coordinates:
* feature_id (feature_id) int32 101 179 181 183 185 843 845 847 849 851 ...
Attributes:
long_name: River Flow
units: m3 s-1
coordinates: latitude longitude
valid_range: [ 0 50000000]]
The problem here is that dask.array.stack() doesn't recognize xarray.DataArray object as holding dask arrays, so it converts them all to NumPy arrays instead. This is how you end up exhausting your memory.
You could fix this in several different possible ways:
Call dask.array.stack() on a list of dask array, e.g., switch big_array.append(ds.streamflow) to big_array.append(ds.streamflow.data).
Use xarray.concat() instead of dask.array.stack(), e.g., writing dask_big_array = xarray.concat(big_array, dim='time').
Use xarray.open_mfdataset() which combines the process of opening many files and stacking them together, e.g., replacing all of your logic here with xarray.open_mfdataset('/path/to/directory/of/files/*').

Slider labels do not correspond to the title

I have this xarray dataset defined as ds:
<xarray.Dataset>
Dimensions: (bnds: 2, lag: 61, plev: 63)
Coordinates:
* plev (plev) float64 1e+03 925.0 850.0 800.0 780.0 750.0 700.0 ...
* lag (lag) int64 -30 -29 -28 -27 -26 -25 -24 -23 -22 -21 -20 -19 ...
* bnds (bnds) int64 0 1
Data variables:
time_bnds (lag, plev, bnds) float64 -3.468e+04 -3.468e+04 -3.468e+04 ...
lat_bnds (lag, plev, bnds) float64 -48.24 -51.95 -48.24 -51.95 -48.24 ...
lon_bnds (lag, plev, bnds) float64 -318.8 -322.5 -318.7 -322.5 -318.7 ...
plev_bnds (lag, plev, bnds) float64 -1e+05 -9.25e+04 -9.25e+04 -8.5e+04 ...
accelogw (lag, plev) float64 -0.001869 0.05221 0.04774 0.02534 0.02233 ...
where the plev coordinate is decrasing (pressure level) from 1000 to 0.0007 hPa.
I define geoviews dataset like this:
import geoviews as gv
kdims = ['plev', 'lag']
vdims = ['accelogw']
dataset = gv.Dataset(ds, kdims=kdims, vdims = vdims)
I create the holoviews Curve object in this way:
%%opts Curve [xrotation=25] NdOverlay [fig_size=300 aspect=1.2]
dataset.to(hv.Curve, 'lag')
resulting in frames beginning with this one:
As you can see the slider label shows the pressure level equal to 0.0007 hPa in contrary to the title showing the pressure level equal to 1000 hPa. Is it a bug or default behavior of holoviews/geoviews for dimensions?
Thanks for your time.
EDIT: I have holoviews on v1.6.2, geoviews on v1.1.0 and xarray on v0.8.2.

Categories